Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>It's not your fault, it's because of different formula used in current <code>sklearn</code> and the one used in the tutorial.</p> <p>The current version of <code>sklearn</code> uses this formula (<a href="https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L954" rel="nofollow">source</a>):</p> <pre><code>idf = log ( n_samples / df ) + 1 </code></pre> <p>where <code>n_samples</code> refers to the total number of documents (<code>|D|</code> in the tutorial) and <code>df</code> refers to the number of documents in which the term appears (<code>{d:t_1 \in D}</code> in the tutorial).</p> <p>To deal with zero division, they by default use smoothing (option <code>smooth_idf=True</code> in <code>TfidfVectorizer</code>, see <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html" rel="nofollow">documentation</a>) that changes the <code>df</code> and <code>n_samples</code> values like this, so those values would be at least 1:</p> <pre><code>df += 1 n_samples += 1 </code></pre> <p>While the one in the tutorial uses this formula:</p> <pre><code>idf = log ( n_samples / (1+df) ) </code></pre> <p>So, you can't get the exact same result as the one in the tutorial, unless you change the formula in the source code.</p> <p><strong>Edit</strong>:</p> <p>Strictly speaking, the right formula is <code>log(n_samples/df)</code>, but since it causes the zero-division problem in practice, people try to modify the formula to allow it to be used in all cases. The most common one is like you said: <code>log(n_samples/(1+df))</code>, but it's not wrong also to use the formula <code>log(n_samples/df)+1</code> given that you've already smoothed it beforehand. But reading the code history, it seems that they did that so that they won't have negative IDF value (as discussed in this <a href="https://github.com/scikit-learn/scikit-learn/pull/514" rel="nofollow">pull request</a> and later updated in <a href="https://github.com/scikit-learn/scikit-learn/commit/0d1daad65a6e39282d65e8315b820ddcafe56066#L1R566" rel="nofollow">this fix</a>). Another way to remove negative IDF value is simply by converting negative values to 0. I have yet to find which one is the more commonly used method.</p> <p>They did agree that the way they do it is not the standard way. So you can safely say that <code>log(n_samples/(1+df))</code> is the correct way.</p> <p>To edit the formula, first I must warn you that <strong>this will affect every user that uses the code, make sure you know what you're doing.</strong></p> <p>You can just go to the source code (in Unix: it's at <code>/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py</code>, in Windows: I'm not using Windows now, but you can search for the file "text.py") and edit the formula directly. You might need administrator/root access, depending on the platform you use.</p> <p><em>Additional note</em>:</p> <p>As an additional note, the order of terms in the vocabulary is also different (at least in my machine), so to get the exact same result (if the formula is the same), you also need to pass in the exact same vocabulary as shown in the tutorial. So using your code:</p> <pre><code>vocabulary = {'blue':0, 'sun':1, 'bright':2, 'sky':3} vectorizer = CountVectorizer(vocabulary=vocabulary) # You don't need stop_words if you use vocabulary vectorizer.fit_transform(train_set) print 'Vocabulary:', vectorizer.vocabulary_ # Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3} </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload