Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>You should start by converting your documents into <a href="http://en.wikipedia.org/wiki/Vector_space_model" rel="noreferrer">TF-log(1 + IDF) vectors</a>: term frequencies are sparse so you should use python dict with term as keys and count as values and then divide by total count to get the global frequencies.</p> <p>Another solution is to use the abs(hash(term)) for instance as positive integer keys. Then you an use scipy.sparse vectors which are more handy and more efficient to perform linear algebra operation than python dict.</p> <p>Also build the 150 frequencies vectors by averaging the frequencies of all the labeled documents belonging to the same category. Then for new document to label, you can compute the <a href="http://en.wikipedia.org/wiki/Cosine_similarity" rel="noreferrer">cosine similarity</a> between the document vector and each category vector and choose the most similar category as label for your document.</p> <p>If this is not good enough, then you should try to train a logistic regression model using a L1 penalty as explained in <a href="http://github.com/ogrisel/scikit-learn/blob/master/examples/plot_logistic_l1_l2_coef.py" rel="noreferrer">this example</a> of <a href="http://scikit-learn.org/" rel="noreferrer">scikit-learn</a> (this is a wrapper for liblinear as explained by @ephes). The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good performance (precision and recall). The scikit learn lib offers a sklearn.metrics module with routines to compute those score for a given model and given dataset.</p> <p>For larger datasets: you should try the <a href="http://github.com/JohnLangford/vowpal_wabbit" rel="noreferrer">vowpal wabbit</a> which is probably the fastest rabbit on earth for large scale document classification problems (but not easy to use python wrappers AFAIK).</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload