Note that there are some explanatory texts on larger screens.

plurals
  1. POPython: tf-idf-cosine: to find document similarity
    text
    copied!<p>I was following a tutorial which was available at <a href="http://blog.christianperone.com/?p=1589" rel="nofollow noreferrer">Part 1</a> &amp; <a href="http://blog.christianperone.com/?p=1747" rel="nofollow noreferrer">Part 2</a>. Unfortunately the author didn't have the time for the final section which involved using cosine similarity to actually find the distance between two documents. I followed the examples in the article with the help of the following link from <a href="https://stackoverflow.com/questions/11911469/tfidf-for-search-queries">stackoverflow</a>, included is the code mentioned in the above link (just so as to make life easier)</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from nltk.corpus import stopwords import numpy as np import numpy.linalg as LA train_set = ["The sky is blue.", "The sun is bright."] # Documents test_set = ["The sun in the sky is bright."] # Query stopWords = stopwords.words('english') vectorizer = CountVectorizer(stop_words = stopWords) #print vectorizer transformer = TfidfTransformer() #print transformer trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() testVectorizerArray = vectorizer.transform(test_set).toarray() print 'Fit Vectorizer to train set', trainVectorizerArray print 'Transform Vectorizer to test set', testVectorizerArray transformer.fit(trainVectorizerArray) print print transformer.transform(trainVectorizerArray).toarray() transformer.fit(testVectorizerArray) print tfidf = transformer.transform(testVectorizerArray) print tfidf.todense() </code></pre> <p>as a result of the above code I have the following matrix</p> <pre><code>Fit Vectorizer to train set [[1 0 1 0] [0 1 0 1]] Transform Vectorizer to test set [[0 1 1 1]] [[ 0.70710678 0. 0.70710678 0. ] [ 0. 0.70710678 0. 0.70710678]] [[ 0. 0.57735027 0.57735027 0.57735027]] </code></pre> <p>I am not sure how to use this output in order to calculate cosine similarity, I know how to implement cosine similarity with respect to two vectors of similar length but here I am not sure how to identify the two vectors.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload