Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>As I told you in my comment, I think you made a mistake somewhere. The vectors actually contain the <code>&lt;word,frequency&gt;</code> pairs, not <code>words</code> only. Therefore, when you delete the sentence, only the frequency of the corresponding words are subtracted by 1 (the words after are not shifted). Consider the following example:</p> <p>Document a:</p> <pre><code>A B C A A B C. D D E A B. D A B C B A. </code></pre> <p>Document b:</p> <pre><code>A B C A A B C. D A B C B A. </code></pre> <p>Vector a:</p> <pre><code>A:6, B:5, C:3, D:3, E:1 </code></pre> <p>Vector b:</p> <pre><code>A:5, B:4, C:3, D:1, E:0 </code></pre> <p>Which result in the following similarity measure:</p> <pre><code>(6*5+5*4+3*3+3*1+1*0)/(Sqrt(6^2+5^2+3^2+3^2+1^2) Sqrt(5^2+4^2+3^2+1^2+0^2))= 62/(8.94427*7.14143)= 0.970648 </code></pre> <p><strong>Edit</strong> I think your source code is not working as well. Consider the following code which works fine with the above example:</p> <pre><code>import java.util.HashMap; import java.util.Map; public class DocumentVector { Map&lt;String, Integer&gt; wordMap = new HashMap&lt;String, Integer&gt;(); public void incCount(String word) { Integer oldCount = wordMap.get(word); wordMap.put(word, oldCount == null ? 1 : oldCount + 1); } double getCosineSimilarityWith(DocumentVector otherVector) { double innerProduct = 0; for(String w: this.wordMap.keySet()) { innerProduct += this.getCount(w) * otherVector.getCount(w); } return innerProduct / (this.getNorm() * otherVector.getNorm()); } double getNorm() { double sum = 0; for (Integer count : wordMap.values()) { sum += count * count; } return Math.sqrt(sum); } int getCount(String word) { return wordMap.containsKey(word) ? wordMap.get(word) : 0; } public static void main(String[] args) { String doc1 = "A B C A A B C. D D E A B. D A B C B A."; String doc2 = "A B C A A B C. D A B C B A."; DocumentVector v1 = new DocumentVector(); for(String w:doc1.split("[^a-zA-Z]+")) { v1.incCount(w); } DocumentVector v2 = new DocumentVector(); for(String w:doc2.split("[^a-zA-Z]+")) { v2.incCount(w); } System.out.println("Similarity = " + v1.getCosineSimilarityWith(v2)); } } </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload