StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POText classification using SVM works with unigrams but not higher order n-grams
primarykey
Id
9789555
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
7
CommunityOwnedDate
CreationDate
2012-03-20T15:17:13.547
FavoriteCount
3
LastActivityDate
2012-03-20T15:17:13.547
LastEditDate
LastEditorUserId
0
OwnerUserId
613076
ParentId
0
PostTypeId
1
Score
0
ViewCount
2012
LastEditorDisplayName
text
Body
I'm using LibSVM (in Java fwiw) to classify text samples into one of two categories: english or spanish language. I'm training on three texts in each language, for a total of roughly 50,000 words each. I'm then testing on a number of shorter texts and checking for appropriate classification. Some of the testing data is drawn from the training data (trivial, but essentially done as a sanity check) and the rest is new. To build the SVM vectors, I have been parsing text into ngrams, and then hashing these ngrams to get numerical representations. For instance, the following vector: <pre><code>2.0 1:9.0 2:3.0 3:1.0 4:7.0 5:4.0 ... </code></pre> with label 2 implies 9 ngrams hashed to value 1, 3 ngrams hashed to value 2, and so on. This has been working well for me when using unigrams, but for some reason as soon as I switch to bigrams or higher order n-grams, classification entirely fails. Can you think of any reason why this might be the case? The size of my feature set is bounded at 4999 (ie. I mod each hash so that it is no bigger than 4999). I've tried increasing and decreasing this bound but to no avail. Does anybody know where the problem might be coming from? Might my corpora be too small, or is it a problem with my approach to tokenization / building feature vectors? Thanks in advance for your help.
Tags
<classification><svm><libsvm><n-gram><document-classification>
Title
Text classification using SVM works with unigrams but not higher order n-grams
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USGeoffroy
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POText classification using SVM works with unigrams but not higher order n-grams
 UserUserId
 USRyan R. Rosario
 VoteTypeVoteTypeId
 VTFavorite
2. VO
 singulars
 PostPostId
 POText classification using SVM works with unigrams but not higher order n-grams
 UserUserId
 USFrank Visaggio
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.