StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
2316031
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2010-02-23T04:04:42.007
FavoriteCount
0
LastActivityDate
2010-02-23T05:07:36.523
LastEditDate
2010-02-23T05:07:36.523
LastEditorUserId
180604
OwnerUserId
180604
ParentId
2315987
PostTypeId
2
Score
3
ViewCount
0
LastEditorDisplayName
text
Body
I'm not sure of your implementation but the <a href="http://en.wikipedia.org/wiki/Dot_product" rel="nofollow noreferrer">cosine distance</a> of two vectors is equal to the normalized dot product of those vectors. The dot product of two matrix can be expressed as a . b = aTb. As a result if the matrix have different length you can't take the dot product to identify the cosine. Now in a standard TF*IDF approach the terms in your matrix should be indexed by <code>term, document</code> as a result any terms not appearing in a document should appear as zeroes in your matrix. Now the way you have it set up seems to suggest there are two different matrices for your two documents. I'm not sure if this is your intent, but it seems incorrect. On the other hand if one of your matrices is supposed to be your query, then it should be a vector and not a matrix, so that the transpose produces the correct result. A full explanation of TF*IDF follows: Ok, in a classic TF*IDF you construct a term-document matrix <code>a</code>. Each value in matrix <code>a</code> is characterized as ai,j where <code>i</code> is the term and <code>j</code> is the document. This value is a combination of local, global and normalized weights (although if you normalize your documents, the normalized weight should be 1). Thus ai,j = fi,j*D/di, where fi,j is the frequency of word <code>i</code> in doc <code>j</code>, <code>D</code> is the document size, and di is the number of documents with term <code>i</code> in them. Your query is a vector of terms designated as <code>b</code>. For each term bi,q in your query refers to term <code>i</code> for query <code>q</code>. bi,q = fi,q where fi,q is the frequency of term <code>i</code> in query <code>q</code>. In this case each query is a vector, and multiple queries form a matrix. We can then calculate the unit vectors of each so that when we take the dot product it will produce the correct cosine. To achieve the unit vector we divide both the matrix <code>a</code> and the query <code>b</code> by their <a href="http://mathworld.wolfram.com/FrobeniusNorm.html" rel="nofollow noreferrer">Frobenius norm</a>. Finally we can perform the cosine distance by taking the transpose of the vector <code>b</code> for a given query. Thus one query (or vector) per calculation. This is denoted as bTa. The final result is a vector with the scoring for each term where a higher score denotes higher document rank.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POjava cosine similarity problem
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. UStzenes
UserOwnerUserId
1. UStzenes
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.