Note that there are some explanatory texts on larger screens.

plurals
  1. POerror in computing text similarity using scikit learn
    text
    copied!<p>I'm a beginner in vector space model (VSM). And i tried the code from <a href="http://pyevolve.sourceforge.net/wordpress/?p=1589" rel="nofollow noreferrer">this site</a>. It's a very good intoduction to VSM but i somehow managed to get different results from the author. It might be because of some compatibility problem as <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html" rel="nofollow noreferrer">scikit learn</a> seems to have changed a lot since the introduction was written. It might be that i misunderstood the explanation as well.<br/> I used the code below to get the wrong answer. Can someone figure out what is wrong with it? I post the result of the code below and the right answer below</p> <p>I have done the computation by hand so i know that the results of website are good. There is another <a href="https://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similarity">Stackoverflow question</a> that use the same code but it doesn't get the same results as the website either.<br/> </p> <pre><code>import numpy, scipy, sklearn train_set = ("The sky is blue.","The sun is bright.") test_set = ("The sun is the sky is bright.", "We can see the shining sun, the bright sun.") from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(stop_words= 'english') vectorizer.fit_transform(train_set) smatrix = vectorizer.transform(test_set) from sklearn.feature_extraction.text import TfidfTransformer tfidf = TfidfTransformer(norm='l2', sublinear_tf=True) tfidf.fit(smatrix) #print smatrix.todense() print tfidf.idf_ tf_idf_matrix = tfidf.transform(smatrix) print tf_idf_matrix.todense() </code></pre> <p><strong>results vector of tf-idf</strong> <br/> #[ 2.09861229 1. 1.40546511 1. ]</p> <p><strong>right vector of tf-idf</strong> <br/> #[0.69314718, -0.40546511, -0.40546511, 0]</p> <p><strong>results tf_idf_matrix</strong> <br/> #[[ 0. 0.50154891 0.70490949 0.50154891]<br/> #[ 0. 0.50854232 0. 0.861037 ]]</p> <p><strong>right answer</strong> <br/> # [[ 0. -0.70710678 -0.70710678 0. ]<br/> # [ 0. -0.89442719 -0.4472136 0. ]]</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload