Note that there are some explanatory texts on larger screens.

plurals
  1. POStemming query in Solr
    text
    copied!<p>We have a search system based on Solr using the Solrnet library in C# which supports some advanced search features like Fuzzy, Synonym and Stemming. While all of these work, the expectation from the Stemming Search seems to be a combination of Stemming by reduction as well as stemming by expansion to cover grammatical variations on a word. A use case will make it more clear:</p> <ul> <li>a search for fish would also find fishing</li> <li>A search for applied would also find applying, applies, and apply</li> </ul> <p>We had implemented Stemming using a CopyField with SnowballPorterFilterFactory. As a result, <strong>when searching for <em>burning</em> the results are returning for <em>burning</em> and <em>burn</em></strong> but when <strong>searching for <em>Burn</em> the results are not returning for <em>burning</em> or <em>burnt</em> or <em>burns</em></strong></p> <p>Since all stemmers supported Lucene/Solr all use stemming by reduction, we are not sure on how to go about this. As per the Solr Wiki: </p> <blockquote> <p>A related technology to stemming is lemmatization, which allows for "stemming" by expansion, taking a root word and 'expanding' it to all of its various forms. Lemmatization can be used either at insertion time or at query time. Lucene/Solr does not have built-in support for lemmatization but it can be simulated by using your own dictionaries and the SynonymFilterFactory</p> </blockquote> <p>We are not sure of exactly how to go about this in Solr. Any ideas.</p> <p>We were also thinking in terms of using some C# based stemmer/lemmatizer library to get the root of the word and using some public database like WordNet to extract the different grammatical variations of the stem and then send across all these terms for querying in Solr. We have not yet done a lot of research to figure out a stable C# stemmer/lemmatizer and a WordNet C# API, but seems like this will get too convoluted and it should have a way to be executed from within Solr.</p> <p>Relevant Portion of Solr Schema:</p> <pre><code>&lt;field name="Content" type="text_general" indexed="false" stored="true" required="true"/&gt; &lt;field name="ContentSearch" type="text_general" indexed="true" stored="false" multiValued="true"/&gt; &lt;field name="ContentSearchStemming" type="text_stem" indexed="true" stored="false" multiValued="true"/&gt; &lt;copyField source="Content" dest="ContentSearch"/&gt; &lt;copyField source="Content" dest="ContentSearchStemming"/&gt; &lt;fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"&gt; &lt;analyzer type="index"&gt; &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt; &lt;filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /&gt; &lt;filter class="solr.LowerCaseFilterFactory"/&gt; &lt;/analyzer&gt; &lt;analyzer type="query"&gt; &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt; &lt;filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /&gt; &lt;filter class="solr.LowerCaseFilterFactory"/&gt; &lt;/analyzer&gt; &lt;/fieldType&gt; &lt;fieldType name="text_stem" class="solr.TextField" &gt; &lt;analyzer&gt; &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt; &lt;filter class="solr.SnowballPorterFilterFactory"/&gt; &lt;/analyzer&gt; &lt;/fieldType&gt; </code></pre> <p>When I am indexing a document, the content gets stored as is in the Content field and gets copied over to ContentSearch and ContentSearchStemming for text based search and stemming search respectively. So, the ContentSearchStemming field does store the stem/reduced form of the terms. I have checked this with the Luke as well as the Admin Schema Browser --> Term Info. In the Admin Analysis screen, I have tested and found that if I index the text "burning", it gets reduced to and stored as "burn". So far so good. </p> <p>Now in the UI, </p> <ul> <li>lets say the user puts in the term "burn" and checks the stemming option.The expectation is that since the user has specified stemming, the results should be returned for the term "burn" as well as for all terms which has their stem as "burn" i.e. burning, burned, burns, etc.</li> <li>lets say the user puts in the term "burning" and checks the stemming option. The expectation is that since the user has specified stemming, the results should be returned for the term "burning" as well as for all terms which has their stem as "burn" i.e. burn, burned, burns, etc.</li> </ul> <p>The query that gets submitted to Solr: </p> <pre><code>q=ContentSearchStemming:burning </code></pre> <p>From Debug Info:</p> <pre><code>&lt;str name="rawquerystring"&gt;ContentSearchStemming:burning&lt;/str&gt; &lt;str name="querystring"&gt;ContentSearchStemming:burning&lt;/str&gt; &lt;str name="parsedquery"&gt;ContentSearchStemming:burn&lt;/str&gt; &lt;str name="parsedquery_toString"&gt;ContentSearchStemming:burn&lt;/str&gt; </code></pre> <p>So, when the results are returned, I am only getting the hits highlighted with the term "burn", though the same document contains terms like burning and burns.</p> <p>I thought that the stemming should work like this:</p> <ol> <li>The stemming filter in the queryanalyzer chain would reduce the input word to its stem. burning --> burn</li> <li>The query component should scan through the terms and match those terms for which it finds a match between the stem of the term with the stem of the input term. burns --> burn (matches) burning --> burn</li> </ol> <p>The first point is happening. But looks like it is executing the search for an exact text based match with the stem "burn". Hence, burns or burned are not getting returned.</p> <p>Hope I was able to make myself clear. </p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload