StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
18692538
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
11
CommunityOwnedDate
CreationDate
2013-09-09T06:25:36.713
FavoriteCount
0
LastActivityDate
2013-09-10T02:56:08.420
LastEditDate
2013-09-10T02:56:08.420
LastEditorUserId
895932
OwnerUserId
895932
ParentId
18687879
PostTypeId
2
Score
4
ViewCount
0
LastEditorDisplayName
text
Body
It's not your fault, it's because of different formula used in current <code>sklearn</code> and the one used in the tutorial. The current version of <code>sklearn</code> uses this formula (<a href="https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L954" rel="nofollow">source</a>): <pre><code>idf = log ( n_samples / df ) + 1 </code></pre> where <code>n_samples</code> refers to the total number of documents (<code>|D|</code> in the tutorial) and <code>df</code> refers to the number of documents in which the term appears (<code>{d:t_1 \in D}</code> in the tutorial). To deal with zero division, they by default use smoothing (option <code>smooth_idf=True</code> in <code>TfidfVectorizer</code>, see <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html" rel="nofollow">documentation</a>) that changes the <code>df</code> and <code>n_samples</code> values like this, so those values would be at least 1: <pre><code>df += 1 n_samples += 1 </code></pre> While the one in the tutorial uses this formula: <pre><code>idf = log ( n_samples / (1+df) ) </code></pre> So, you can't get the exact same result as the one in the tutorial, unless you change the formula in the source code. Edit: Strictly speaking, the right formula is <code>log(n_samples/df)</code>, but since it causes the zero-division problem in practice, people try to modify the formula to allow it to be used in all cases. The most common one is like you said: <code>log(n_samples/(1+df))</code>, but it's not wrong also to use the formula <code>log(n_samples/df)+1</code> given that you've already smoothed it beforehand. But reading the code history, it seems that they did that so that they won't have negative IDF value (as discussed in this <a href="https://github.com/scikit-learn/scikit-learn/pull/514" rel="nofollow">pull request</a> and later updated in <a href="https://github.com/scikit-learn/scikit-learn/commit/0d1daad65a6e39282d65e8315b820ddcafe56066#L1R566" rel="nofollow">this fix</a>). Another way to remove negative IDF value is simply by converting negative values to 0. I have yet to find which one is the more commonly used method. They did agree that the way they do it is not the standard way. So you can safely say that <code>log(n_samples/(1+df))</code> is the correct way. To edit the formula, first I must warn you that this will affect every user that uses the code, make sure you know what you're doing. You can just go to the source code (in Unix: it's at <code>/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py</code>, in Windows: I'm not using Windows now, but you can search for the file "text.py") and edit the formula directly. You might need administrator/root access, depending on the platform you use. Additional note: As an additional note, the order of terms in the vocabulary is also different (at least in my machine), so to get the exact same result (if the formula is the same), you also need to pass in the exact same vocabulary as shown in the tutorial. So using your code: <pre><code>vocabulary = {'blue':0, 'sun':1, 'bright':2, 'sky':3} vectorizer = CountVectorizer(vocabulary=vocabulary) # You don't need stop_words if you use vocabulary vectorizer.fit_transform(train_set) print 'Vocabulary:', vectorizer.vocabulary_ # Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3} </code></pre>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POerror in computing text similarity using scikit learn
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USjusthalf
UserOwnerUserId
1. USjusthalf
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POerror in computing text similarity using scikit learn
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.