Note that there are some explanatory texts on larger screens.

plurals
  1. POhow to perfom classfication
    primarykey
    data
    text
    <p>I'm trying to perform document classification into two categories (category1 and category2), using Weka. </p> <p>I've gathered a training set consisting of 600 documents belonging to both categories and the total number of documents that are going to be classified is 1,000,000.</p> <p>So to perform the classification, I apply the StringToWordVector filter. I set true the followings from the filter: - IDF transform - TF ransform - OutputWordCounts</p> <p>I'd like to ask a few questions about this process.</p> <p>1) How many documents shall I use as training set, so that I over-fitting is avoided? </p> <p>2) After applying the filter, I get a list of the words in the training set. Do I have to remove any of them to get a better result at the classifier or it doesn't play any role?</p> <p>3) As classification method I usually choose naiveBayes but the results I get are the followings:</p> <pre><code>------------------------- Correctly Classified Instances 393 70.0535 % Incorrectly Classified Instances 168 29.9465 % Kappa statistic 0.415 Mean absolute error 0.2943 Root mean squared error 0.5117 Relative absolute error 60.9082 % Root relative squared error 104.1148 % ---------------------------- </code></pre> <p>and if I use SMO the results are:</p> <pre><code>------------------------------ Correctly Classified Instances 418 74.5098 % Incorrectly Classified Instances 143 25.4902 % Kappa statistic 0.4742 Mean absolute error 0.2549 Root mean squared error 0.5049 Relative absolute error 52.7508 % Root relative squared error 102.7203 % Total Number of Instances 561 ------------------------------ </code></pre> <p>So in document classification which one is "better" classifier? Which one is better for small data sets, like the one I have? I've read that naiveBayes performs better with big data sets but if I increase my data set, will it cause the "over-fitting" effect? Also, about Kappa statistic, is there any accepted threshold or it doesn't matter in this case because there are only two categories? </p> <p>Sorry for the long post, but I've been trying for a week to improve the classification results with no success, although I tried to get documents that fit better in each category.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload