Note that there are some explanatory texts on larger screens.

plurals
  1. POLucens best way to do "starts-with" queries
    primarykey
    data
    text
    <p>I want to be able to do the following types of queries:</p> <p>The data to index consists of (let's say), music videos where only the title is interesting. I simply want to index these and then create queries for them such that, whatever word or words the user used in the query, the documents containing those words, in that order, at the beginning of the tile will be returned first, followed (in no particular order) by documents containing at least one of the searched words in any position of the title. Also all this should be case insensitive.</p> <p>Example:</p> <p>For documents:</p> <ul> <li>Video1Title = Sea is blue</li> <li>Video2Title = Wild sea</li> <li>Video3Title = Wild sea Whatever</li> <li>Video4Title = Seaside Whatever</li> </ul> <p>If I search "sea" I want to get </p> <ul> <li>"Video1Title = Sea is blue" </li> </ul> <p>first followed by all the other documents that contain "sea" in title, but not at the beginning. </p> <p>If I search "Wild sea" I want to get</p> <ul> <li>Video2Title = Wild sea</li> <li>Video3Title = Wild sea Whatever</li> </ul> <p>first followed by all the other documents that have "Wild" or "Sea" in their title but don't have "Wild Sea" as title prefix.</p> <p>If I search "Seasi" I don't wanna get anything (I don't care for Keyword Tokenization and prefix queries).</p> <p>Now AFAIKS, there's no actual way to tell Lucene "find me documents where word1 and word2 and etc. are in positions 1 and 2 and 3 and etc."</p> <p>There are "workarounds" to simulate that behaviour:</p> <ul> <li><p>Index the field twice. In <code>field1</code> you have the words tokenized (using perhaps <code>StandardAnalyzer</code>) and in <code>field2</code> you have them all clumped up into one element (using <code>KeywordAnalyzer</code>). Then if you search something like :</p> <p>+(field1:word1 word2 word3) (field2:"word1 word2 word3*")</p></li> </ul> <p>effectively telling Lucene "Documents must contain word1 or word2 or word3 in the title, and furthermore those that match "title starts with >word1 word2 word3&lt;" are better (get higher score).</p> <ul> <li>Add a "lucene_start_token" to the beginning of the field when indexing them such that <code>Video2Title = Wild sea</code> is indexed as "<code>title:lucene_start_token Wild sea</code>" and so on for the rest</li> </ul> <p>Then do a query such that:</p> <p>+(title:sea) (title:"lucene_start_token sea")</p> <p>and having Lucene return all documents which contain my search word(s) in the title and also give a better score on those who matched "lucene_start_token+search words"</p> <p>My question is then, are there indeed better ways to do this (maybe using <a href="https://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/api/all/org/apache/lucene/search/PhraseQuery.html" rel="nofollow">PhraseQuery</a> and <a href="https://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/api/all/org/apache/lucene/index/Term.html" rel="nofollow">Term</a> <a href="https://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/api/all/org/apache/lucene/search/PhraseQuery.html#add%28org.apache.lucene.index.Term,%20int%29" rel="nofollow">position</a>)? If not, which of the above is better perfromance-wise? </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload