StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
17608262
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
5
CommunityOwnedDate
CreationDate
2013-07-12T05:54:02.883
FavoriteCount
0
LastActivityDate
2013-07-12T10:56:20.603
LastEditDate
2013-07-12T10:56:20.603
LastEditorUserId
2549155
OwnerUserId
2549155
ParentId
17606483
PostTypeId
2
Score
4
ViewCount
0
LastEditorDisplayName
text
Body
In general, relevance is something you define using some sort of scoring function. I will give you an example of a naive scoring algorithm, as well as one of the common search engine ranking algorithms (used for documents, but I modified it for sentences for educational purposes). <h2>Naive ranking</h2> Here's an example of a naive ranking algorithm. The ranking could go as simple as: <ol> <li>Sentences are ranked based on the average proximity between the query terms (e.g. the biggest number of words between all possible query term pairs), meaning that a sentence "Rock climbing is awesome" is ranked higher than "I am not a fan of climbing because I am lazy like a rock."</li> <li>More word matches are ranked higher, e.g. "Climbing is fun" is ranked higher than "Jogging is fun."</li> <li>Pick alphabetical or random favorites in case of a tie, e.g. "Climbing is life" is ranked higher than "I am a rock."</li> </ol> <h2>Some common search engine ranking</h2> <h3>BM25</h3> BM25 is a good robust algorithm for scoring documents with relation to the query. For reference purposes, here's a Wikipedia article about <a href="http://en.wikipedia.org/wiki/Okapi_BM25" rel="nofollow">BM25 ranking algorithm</a>. You would want to modify it a little because you are dealing with sentences, but you can take a similar approach by treating each sentence as a 'document'. Here it goes. Assuming your query consists of keywords q1, q2, ... , qm, the score of a sentence S with respect to the query Q is calculated as follows: <blockquote> SCORE(S, Q) = SUM(i=1..m) (IDF(qi * f(qi, S) * (k1 + 1) / (f(qi, S) + k1 * (1 - b + b * |S| / AVG_SENT_LENGTH)) </blockquote> k1 and b are free parameters (could be chosen as k in [1.2, 2.0] and b = 0.75 -- you can find some good values empirically) f(qi, S) is the term frequency of qi in a sentence S (could treat is as just the number of times the term occurs), |S| is the length of your sentence (in words), and AVG_SENT_LENGTH is the average sentence length of your sentences in a document. Finally, IDF(qi) is the inverse document frequency (or, in this case, inverse sentence frequency) of the qi, which is usually computed as: <blockquote> IDF(qi) = log ((N - n(qi) + 0.5) / (n(qi) + 0.5)) </blockquote> Where N is the total number of sentences, and n(qi) is the number of sentences containing qi. <h3>Speed</h3> Assume you don't store inverted index or any additional data structure for fast access. These are the terms that could be pre-computed: N, *AVG_SENT_LENGTH*. First, notice that the more terms are matched, the higher this sentence will be scored (because of the sum terms). So if you get top k terms from the query, you really need to compute the values f(qi, S), |S|, and n(qi), which will take <code>O(AVG_SENT_LENGTH * m * k)</code>, or if you are ranking all the sentences in the worst case, <code>O(DOC_LENGTH * m)</code> time where k is the number of documents that have the highest number of terms matched and m is the number of query terms. Assuming each sentence is about AVG_SENT_LENGTH, and you have to go m times for each of the k sentences. <h3>Inverted index</h3> Now let's look at <a href="http://en.wikipedia.org/wiki/Inverted_index" rel="nofollow">inverted index</a> to allow fast text searches. We will treat your sentences as documents for educational purposes. The idea is to built a data structure for your BM25 computations. We will need to store term frequencies using inverted lists: <blockquote> wordi: (sent_id1, tf1), (sent_id2, tf2), ... ,(sent_idk, tfk) </blockquote> Basically, you have hashmaps where your key is <code>word</code> and your value is list of pairs <code>(sent_idj, tfk)</code> corresponding to ids of sentences and frequency of a word. For example, it could be: <blockquote> rock: (1, 1), (5, 2) </blockquote> This tells us that the word rock occurs in the first sentence 1 time and in the fifth sentence 2 times. This pre-processing step will allow you to get <code>O(1)</code> access to the term frequencies for any particular word, so it will be fast as you want it. Also, you would want to have another hashmap to store sentence length, which should be a fairly easy task. How to build inverted index? I am skipping stemming and lemmatization in your case, but you are welcome to read more about it. In short, you traverse through your document, continuously creating pairs/increasing frequencies for your hashmap containing the words. Here are some <a href="http://www.cs.princeton.edu/courses/archive/spr08/cos435/Class_notes/indexBuilding_topost.pdf" rel="nofollow">slides</a> on building the index.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POrelevant text search in a large text file
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USgramonov
UserOwnerUserId
1. USgramonov
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.