StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow to count term frequency for set of documents?
text
Body
copied!<p>i have a Lucene-Index with following documents:</p> <pre><code>doc1 := { caldari, jita, shield, planet } doc2 := { gallente, dodixie, armor, planet } doc3 := { amarr, laser, armor, planet } doc4 := { minmatar, rens, space } doc5 := { jove, space, secret, planet } </code></pre> <p>so these 5 documents use 14 different terms:</p> <pre><code>[ caldari, jita, shield, planet, gallente, dodixie, armor, amarr, laser, minmatar, rens, jove, space, secret ] </code></pre> <p>the frequency of each term:</p> <pre><code>[ 1, 1, 1, 4, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1 ] </code></pre> <p>for easy reading:</p> <pre><code>[ caldari:1, jita:1, shield:1, planet:4, gallente:1, dodixie:1, armor:2, amarr:1, laser:1, minmatar:1, rens:1, jove:1, space:2, secret:1 ] </code></pre> <p>What i do want to know now is, how to obtain the term frequency vector for a set of documents?</p> <p>for example:</p> <pre><code>Set<Documents> docs := [ doc2, doc3 ] termFrequencies = magicFunction(docs); System.out.pring( termFrequencies ); </code></pre> <p>would result in the ouput:</p> <pre><code>[ caldari:0, jita:0, shield:0, planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1, minmatar:0, rens:0, jove:0, space:0, secret:0 ] </code></pre> <p>remove all zeros:</p> <pre><code>[ planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1 ] </code></pre> <p>Notice, that the result vetor contains only the term frequencies of the set of documents. NOT the overall frequencies of the whole index! The term 'planet' is present 4 times in the whole index but the source set of documents only contains it 2 times.</p> <p>A naive implementation would be to just iterate over all documents in the <code>docs</code> set, create a map and count each term. But i need a solution that would also work with a document set size of 100.000 or 500.000. </p> <p>Is there a feature in Lucene i can use to obtain this term vector? If there is no such feature, how would a data structure look like someone can create at index time to obtain such a term vector easily and fast?</p> <p>I'm not that Lucene expert so i'am sorry if the solution is obvious or trivial.</p> <p>Maybe worth to mention: the solution should work fast enough for a web application, applied to client search queries.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload