Note that there are some explanatory texts on larger screens.

plurals
  1. POBest scikit classifier for text classification task
    primarykey
    data
    text
    <p>I am using scikit to do text classification of short phrases to their meaning. Some examples are:</p> <pre><code>"Yes" - label.yes "Yeah" - label.yes ... "I don't know" - label.i_don't_know "I am not sure" - label.i_don't_know "I have no idea" - label.i_don't_know </code></pre> <p>Everything worked pretty well using TfidfVectorizer and a MultinomialNB classifier.</p> <p>The problem occurred when I added a new text/label pair:</p> <pre><code>"I" - label.i </code></pre> <p>Predicting the class for "I" still returns label.i_don't_know even though the text is exactly in the training data like this, which is probably due to the fact that the unigram "I" occurs more often in label.i_don't_know than in label.i.</p> <p>Is there a classifier that will give comparable or better performance on this task and guarantee that predictions of training data elements are returned correctly?</p> <p>This code illustrates the problem further:</p> <pre><code>from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB #instantiate classifier and vectorizer clf=MultinomialNB(alpha=.01) vectorizer =TfidfVectorizer(min_df=1,ngram_range=(1,2)) #Apply vectorizer to training data traindata=['yes','yeah','i do not know','i am not sure','i have no idea','i']; X_train=vectorizer.fit_transform(traindata) #Label Ids y_train=[0,0,1,1,1,2]; #Train classifier clf.fit(X_train, y_train) print clf.predict(vectorizer.transform(['i'])) </code></pre> <p>The code outputs label 1, but the correct classification would be label 2.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload