Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>First off, if you want to extract count features and apply TF-IDF normalization and row-wise euclidean normalization you can do it in one operation with <code>TfidfVectorizer</code>:</p> <pre><code>&gt;&gt;&gt; from sklearn.feature_extraction.text import TfidfVectorizer &gt;&gt;&gt; from sklearn.datasets import fetch_20newsgroups &gt;&gt;&gt; twenty = fetch_20newsgroups() &gt;&gt;&gt; tfidf = TfidfVectorizer().fit_transform(twenty.data) &gt;&gt;&gt; tfidf &lt;11314x130088 sparse matrix of type '&lt;type 'numpy.float64'&gt;' with 1787553 stored elements in Compressed Sparse Row format&gt; </code></pre> <p>Now to find the cosine distances of one document (e.g. the first in the dataset) and all of the others you just need to compute the dot products of the first vector with all of the others as the tfidf vectors are already row-normalized. The scipy sparse matrix API is a bit weird (not as flexible as dense N-dimensional numpy arrays). To get the first vector you need to slice the matrix row-wise to get a submatrix with a single row:</p> <pre><code>&gt;&gt;&gt; tfidf[0:1] &lt;1x130088 sparse matrix of type '&lt;type 'numpy.float64'&gt;' with 89 stored elements in Compressed Sparse Row format&gt; </code></pre> <p>scikit-learn already provides pairwise metrics (a.k.a. kernels in machine learning parlance) that work for both dense and sparse representations of vector collections. In this case we need a dot product that is also known as the linear kernel:</p> <pre><code>&gt;&gt;&gt; from sklearn.metrics.pairwise import linear_kernel &gt;&gt;&gt; cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten() &gt;&gt;&gt; cosine_similarities array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602, 0.04457106, 0.03293218]) </code></pre> <p>Hence to find the top 5 related documents, we can use <code>argsort</code> and some negative array slicing (most related documents have highest cosine similarity values, hence at the end of the sorted indices array):</p> <pre><code>&gt;&gt;&gt; related_docs_indices = cosine_similarities.argsort()[:-5:-1] &gt;&gt;&gt; related_docs_indices array([ 0, 958, 10576, 3277]) &gt;&gt;&gt; cosine_similarities[related_docs_indices] array([ 1. , 0.54967926, 0.32902194, 0.2825788 ]) </code></pre> <p>The first result is a sanity check: we find the query document as the most similar document with a cosine similarity score of 1 which has the following text:</p> <pre><code>&gt;&gt;&gt; print twenty.data[0] From: lerxst@wam.umd.edu (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ---- </code></pre> <p>The second most similar document is a reply that quotes the original message hence has many common words:</p> <pre><code>&gt;&gt;&gt; print twenty.data[958] From: rseymour@reed.edu (Robert Seymour) Subject: Re: WHAT car is this!? Article-I.D.: reed.1993Apr21.032905.29286 Reply-To: rseymour@reed.edu Organization: Reed College, Portland, OR Lines: 26 In article &lt;1993Apr20.174246.14375@wam.umd.edu&gt; lerxst@wam.umd.edu (where's my thing) writes: &gt; &gt; I was wondering if anyone out there could enlighten me on this car I saw &gt; the other day. It was a 2-door sports car, looked to be from the late 60s/ &gt; early 70s. It was called a Bricklin. The doors were really small. In addition, &gt; the front bumper was separate from the rest of the body. This is &gt; all I know. If anyone can tellme a model name, engine specs, years &gt; of production, where this car is made, history, or whatever info you &gt; have on this funky looking car, please e-mail. Bricklins were manufactured in the 70s with engines from Ford. They are rather odd looking with the encased front bumper. There aren't a lot of them around, but Hemmings (Motor News) ususally has ten or so listed. Basically, they are a performance Ford with new styling slapped on top. &gt; ---- brought to you by your neighborhood Lerxst ---- Rush fan? -- Robert Seymour rseymour@reed.edu Physics and Philosophy, Reed College (NeXTmail accepted) Artificial Life Project Reed College Reed Solar Energy Project (SolTrain) Portland, OR </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload