Note that there are some explanatory texts on larger screens.

plurals
  1. POBetter way of calculating document Similarity using Lucene
    primarykey
    data
    text
    <p>I’m indexing a collection of documents using Lucene by specifying TermVector at indexing time. Then I retrieve terms and their frequencies by reading the index and calculating TF-IDF score vectors for each document. Then, using the TF-IDF vectors, I calculate pairwise cosine similarity between documents using <a href="http://en.wikipedia.org/wiki/Cosine_similarity" rel="nofollow">Wikipedia's cosine similarity equation</a>.</p> <p>This is my problem: Say I have two identical documents “A” and “B” in this collection (A and B have more than 200 sentences). If I calculate pairwise cosine similarity between A and B it gives me cosine value=1 which is perfectly OK. But if I remove a single sentence from Doc “B”, it gives me cosine similarity value around 0.85 between these two documents. The documents are almost similar but cosine values are not. I understand the problem is with the equation that I’m using. </p> <p>Is there better way / equation that I can use for calculating cosine similarity between documents? </p> <p><strong>Edited</strong></p> <p>This is how I calculate Cosine Similarity, <code>doc1[]</code> and <code>doc2[]</code> are TF-IDF vectors for corresponding document. the vector contains only the <code>scores</code> but not the <code>words</code> </p> <pre><code>private double cosineSimBetweenTwoDocs(float doc1[], float doc2[]) { double temp; int doc1Len = doc1.length; int doc2Len = doc2.length; float numerator = 0; float temSumDoc1 = 0; float temSumDoc2 = 0; double equlideanNormOfDoc1 = 0; double equlideanNormOfDoc2 = 0; if (doc1Len &gt; doc2Len) { for (int i = 0; i &lt; doc2Len; i++) { numerator += doc1[i] * doc2[i]; temSumDoc1 += doc1[i] * doc1[i]; temSumDoc2 += doc2[i] * doc2[i]; } equlideanNormOfDoc1=Math.sqrt(temSumDoc1); equlideanNormOfDoc2=Math.sqrt(temSumDoc2); } else { for (int i = 0; i &lt; doc1Len; i++) { numerator += doc1[i] * doc2[i]; temSumDoc1 += doc1[i] * doc1[i]; temSumDoc2 += doc2[i] * doc2[i]; } equlideanNormOfDoc1=Math.sqrt(temSumDoc1); equlideanNormOfDoc2=Math.sqrt(temSumDoc2); } temp = numerator / (equlideanNormOfDoc1 * equlideanNormOfDoc2); return temp; } </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload