Note that there are some explanatory texts on larger screens.

plurals
  1. POWhy does Weka RandomForest gives me a different result than Scikit RandomForestClassifier?
    primarykey
    data
    text
    <p>I am getting peculiar differences in results between WEKA and scikit while using the same RandomForest technique and the same dataset. With scikit I am getting an AUC around 0.62 (all the time, for I did extensive testing). However, with WEKA, im getting results close to 0.79. Thats a huge difference!</p> <p>The dataset I tested the algorithms on is KC1.arff, of which I put a copy in my public dropbox folder <a href="https://dl.dropbox.com/u/30688032/KC1.arff" rel="nofollow">https://dl.dropbox.com/u/30688032/KC1.arff</a>. For WEKA, I simply downloaded the .jar file from <a href="http://www.cs.waikato.ac.nz/ml/weka/downloading.html" rel="nofollow">http://www.cs.waikato.ac.nz/ml/weka/downloading.html</a>. In WEKA, I set the cross-validation parameter as 10-fold, the dataset as KC1.arff, the algorithm as "RandomForest -l 19 -K 0 -S 1". Then ran the code! Once you generate the results in WEKA, it should be saved as a file, .csv or .arff. Read that file and check the column 'Area_under_ROC', it should be somewhat close to 0.79.</p> <p>Below is the code for the scikit's RandomForest</p> <pre><code>import numpy as np from pandas import * from sklearn.ensemble import RandomForestClassifier def read_arff(f): from scipy.io import arff data, meta = arff.loadarff(f) return DataFrame(data) def kfold(clr,X,y,folds=10): from sklearn.cross_validation import StratifiedKFold from sklearn import metrics auc_sum=0 kf = StratifiedKFold(y, folds) for train_index, test_index in kf: X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] clr.fit(X_train, y_train) pred_test = clr.predict(X_test) print metrics.auc_score(y_test,pred_test) auc_sum+=metrics.auc_score(y_test,pred_test) print 'AUC: ', auc_sum/folds print "----------------------------" #read the dataset X=read_arff('KC1.arff') y=X['Defective'] #changes N, and Y to 0, and 1 respectively s = np.unique(y) mapping = Series([x[0] for x in enumerate(s)], index = s) y=y.map(mapping) del X['Defective'] #initialize random forests (by defualt it is set to 10 trees) rf=RandomForestClassifier() #run algorithm kfold(rf,np.array(X),y) #You will get an average AUC around 0.62 as opposed to 0.79 in WEKA </code></pre> <p>Please keep in mind that the real auc value, as shown in relevant papers' experimental results, is around 0.79, so the problem lies on my implementation that uses the scikit random forests.</p> <p>Your kind help will be highly appreciated!!</p> <p>Thank you very much!</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. COFirst, you should make sure that you're using the same parameters for the RF implementation in scikit. Second, as the name suggests, there's some randomness associated with the results -- you mention that you did extensive testing, but it might not have been extensive enough. Third, the partition of your data will also affect the results. In particular, you should make sure that the folds you generate are stratified.
      singulars
    2. COI did do VERY extensive testing! with scikit, the values never exceeded 0.64, and the auc values i get are always close to 0.57. With WEKA, I also did lots of testing, and I always get values close to 0.79, so I dont think the randomness is the factor here. For both algorithms I used 10-fold, which also gave me the same results as using 70% training and 30% testing split, so, i think my method for validation is not a factor as well. However, you might be right on the parameters, I tried my best to set them to be the same, thats why I am asking if you can kindly find the flaw :) :)! Thank you!
      singulars
    3. COMy wild guess is that your folds in scikit are not stratified.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload