Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>In general, <em>relevance</em> is something you define using some sort of scoring function. I will give you an example of a naive scoring algorithm, as well as one of the common search engine ranking algorithms (used for documents, but I modified it for sentences for educational purposes).</p> <h2>Naive ranking</h2> <p>Here's an example of a naive ranking algorithm. The ranking could go as simple as:</p> <ol> <li>Sentences are ranked based on the average proximity between the query terms (e.g. the biggest number of words between all possible query term pairs), meaning that a sentence "Rock climbing is awesome" is ranked higher than "I am not a fan of climbing because I am lazy like a rock."</li> <li>More word matches are ranked higher, e.g. "Climbing is fun" is ranked higher than "Jogging is fun."</li> <li>Pick alphabetical or random favorites in case of a tie, e.g. "Climbing is life" is ranked higher than "I am a rock."</li> </ol> <h2>Some common search engine ranking</h2> <h3>BM25</h3> <p>BM25 is a good robust algorithm for scoring documents with relation to the query. For reference purposes, here's a Wikipedia article about <a href="http://en.wikipedia.org/wiki/Okapi_BM25" rel="nofollow">BM25 ranking algorithm</a>. You would want to modify it a little because you are dealing with sentences, but you can take a similar approach by treating each sentence as a 'document'.</p> <p>Here it goes. Assuming your query consists of keywords q<sub>1</sub>, q<sub>2</sub>, ... , q<sub>m</sub>, the score of a sentence <strong>S</strong> with respect to the query <strong>Q</strong> is calculated as follows:</p> <blockquote> <p>SCORE(S, Q) = SUM(i=1..m) (IDF(q<sub>i</sub> * f(q<sub>i</sub>, S) * (k<sub>1</sub> + 1) / (f(q<sub>i</sub>, S) + k<sub>1</sub> * (1 - b + b * |S| / AVG_SENT_LENGTH))</p> </blockquote> <p>k<sub>1</sub> and b are free parameters (could be chosen as k in [1.2, 2.0] and b = 0.75 -- you can find some good values empirically) f(q<sub>i</sub>, S) is the term frequency of q<sub>i</sub> in a sentence <strong>S</strong> (could treat is as just the number of times the term occurs), |S| is the length of your sentence (in words), and AVG_SENT_LENGTH is the average sentence length of your sentences in a document. Finally, IDF(q<sub>i</sub>) is the inverse document frequency (or, in this case, inverse sentence frequency) of the q<sub>i</sub>, which is usually computed as:</p> <blockquote> <p>IDF(q<sub>i</sub>) = log ((N - n(q<sub>i</sub>) + 0.5) / (n(q<sub>i</sub>) + 0.5))</p> </blockquote> <p>Where <strong>N</strong> is the total number of sentences, and n(q<sub>i</sub>) is the number of sentences containing q<sub>i</sub>.</p> <h3>Speed</h3> <p>Assume you don't store inverted index or any additional data structure for fast access. These are the terms that could be pre-computed: <em>N</em>, *AVG_SENT_LENGTH*.</p> <p>First, notice that the more terms are matched, the higher this sentence will be scored (because of the sum terms). So if you get top <em>k</em> terms from the query, you really need to compute the values f(q<sub>i</sub>, S), |S|, and n(q<sub>i</sub>), which will take <code>O(AVG_SENT_LENGTH * m * k)</code>, or if you are ranking all the sentences in the worst case, <code>O(DOC_LENGTH * m)</code> time where <em>k</em> is the number of documents that have the highest number of terms matched and <em>m</em> is the number of query terms. Assuming each sentence is about AVG_SENT_LENGTH, and you have to go <em>m</em> times for each of the <em>k</em> sentences.</p> <h3>Inverted index</h3> <p>Now let's look at <a href="http://en.wikipedia.org/wiki/Inverted_index" rel="nofollow">inverted index</a> to allow fast text searches. We will treat your sentences as documents for educational purposes. The idea is to built a data structure for your BM25 computations. We will need to store term frequencies using inverted lists:</p> <blockquote> <p>word<sub>i</sub>: (sent_id<sub>1</sub>, tf<sub>1</sub>), (sent_id<sub>2</sub>, tf<sub>2</sub>), ... ,(sent_id<sub>k</sub>, tf<sub>k</sub>)</p> </blockquote> <p>Basically, you have hashmaps where your key is <code>word</code> and your value is list of pairs <code>(sent_id&lt;sub&gt;j&lt;/sub&gt;, tf&lt;sub&gt;k&lt;/sub&gt;)</code> corresponding to ids of sentences and frequency of a word. For example, it could be:</p> <blockquote> <p>rock: (1, 1), (5, 2)</p> </blockquote> <p>This tells us that the word rock occurs in the first sentence 1 time and in the fifth sentence 2 times.</p> <p>This pre-processing step will allow you to get <code>O(1)</code> access to the term frequencies for any particular word, so it will be fast as you want it.</p> <p>Also, you would want to have another hashmap to store sentence length, which should be a fairly easy task.</p> <p>How to build inverted index? I am skipping stemming and lemmatization in your case, but you are welcome to read more about it. In short, you traverse through your document, continuously creating pairs/increasing frequencies for your hashmap containing the words. Here are some <a href="http://www.cs.princeton.edu/courses/archive/spr08/cos435/Class_notes/indexBuilding_topost.pdf" rel="nofollow">slides</a> on building the index.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload