Note that there are some explanatory texts on larger screens.

plurals
  1. POExtract tf-idf vectors with lucene
    text
    copied!<p>I have indexed a set of documents using lucene. I also have stored DocumentTermVector for each document content. I wrote a program and got the term frequency vector for each document, but how can I get tf-idf vector of each document?</p> <p>Here is my code that outputs term frequencies in each document:</p> <pre><code>Directory dir = FSDirectory.open(new File(indexDir)); IndexReader ir = IndexReader.open(dir); for (int docNum=0; docNum&lt;ir.numDocs(); docNum++) { System.out.println(ir.document(docNum).getField("filename").stringValue()); TermFreqVector tfv = ir.getTermFreqVector(docNum, "contents"); if (tfv == null) { // ignore empty fields continue; } String terms[] = tfv.getTerms(); int termCount = terms.length; int freqs[] = tfv.getTermFrequencies(); for (int t=0; t &lt; termCount; t++) { System.out.println(terms[t] + " " + freqs[t]); } } </code></pre> <p>Is there any buit-in function in lucene for me to do that?</p> <hr> <p>Nobody helped, and I did it by myself:</p> <pre><code> Directory dir = FSDirectory.open(new File(indexDir)); IndexReader ir = IndexReader.open(dir); int docNum; for (docNum = 0; docNum&lt;ir.numDocs(); docNum++) { TermFreqVector tfv = ir.getTermFreqVector(docNum, "title"); if (tfv == null) { // ignore empty fields continue; } String tterms[] = tfv.getTerms(); int termCount = tterms.length; int freqs[] = tfv.getTermFrequencies(); for (int t=0; t &lt; termCount; t++) { double idf = ir.numDocs()/ir.docFreq(new Term("title", tterms[t])); System.out.println(tterms[t] + " " + freqs[t]*Math.log(idf)); } } </code></pre> <p>is there any way to find the ID number of each term?</p> <hr> <p>Nobody helped, and I did it by myself again:</p> <pre><code> List list = new LinkedList(); terms = null; try { terms = ir.terms(new Term("title", "")); while ("title".equals(terms.term().field())) { list.add(terms.term().text()); if (!terms.next()) break; } } finally { terms.close(); } int docNum; for (docNum = 0; docNum&lt;ir.numDocs(); docNum++) { TermFreqVector tfv = ir.getTermFreqVector(docNum, "title"); if (tfv == null) { // ignore empty fields continue; } String tterms[] = tfv.getTerms(); int termCount = tterms.length; int freqs[] = tfv.getTermFrequencies(); for (int t=0; t &lt; termCount; t++) { double idf = ir.numDocs()/ir.docFreq(new Term("title", tterms[t])); System.out.println(Collections.binarySearch(list, tterms[t]) + " " + tterms[t] + " " + freqs[t]*Math.log(idf)); } } </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload