StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
3114191
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
9
CommunityOwnedDate
CreationDate
2010-06-24T21:55:16.733
FavoriteCount
0
LastActivityDate
2013-07-09T16:32:58.423
LastEditDate
2013-07-09T16:32:58.423
LastEditorUserId
163740
OwnerUserId
163740
ParentId
3113428
PostTypeId
2
Score
31
ViewCount
0
LastEditorDisplayName
text
Body
You should start by converting your documents into <a href="http://en.wikipedia.org/wiki/Vector_space_model" rel="noreferrer">TF-log(1 + IDF) vectors</a>: term frequencies are sparse so you should use python dict with term as keys and count as values and then divide by total count to get the global frequencies. Another solution is to use the abs(hash(term)) for instance as positive integer keys. Then you an use scipy.sparse vectors which are more handy and more efficient to perform linear algebra operation than python dict. Also build the 150 frequencies vectors by averaging the frequencies of all the labeled documents belonging to the same category. Then for new document to label, you can compute the <a href="http://en.wikipedia.org/wiki/Cosine_similarity" rel="noreferrer">cosine similarity</a> between the document vector and each category vector and choose the most similar category as label for your document. If this is not good enough, then you should try to train a logistic regression model using a L1 penalty as explained in <a href="http://github.com/ogrisel/scikit-learn/blob/master/examples/plot_logistic_l1_l2_coef.py" rel="noreferrer">this example</a> of <a href="http://scikit-learn.org/" rel="noreferrer">scikit-learn</a> (this is a wrapper for liblinear as explained by @ephes). The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good performance (precision and recall). The scikit learn lib offers a sklearn.metrics module with routines to compute those score for a given model and given dataset. For larger datasets: you should try the <a href="http://github.com/JohnLangford/vowpal_wabbit" rel="noreferrer">vowpal wabbit</a> which is probably the fastest rabbit on earth for large scale document classification problems (but not easy to use python wrappers AFAIK).
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POClassifying Documents into Categories
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USogrisel
UserOwnerUserId
1. USogrisel
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POClassifying Documents into Categories
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.