StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POWhy does Weka RandomForest gives me a different result than Scikit RandomForestClassifier?
text
Body
copied!<p>I am getting peculiar differences in results between WEKA and scikit while using the same RandomForest technique and the same dataset. With scikit I am getting an AUC around 0.62 (all the time, for I did extensive testing). However, with WEKA, im getting results close to 0.79. Thats a huge difference!</p> <p>The dataset I tested the algorithms on is KC1.arff, of which I put a copy in my public dropbox folder <a href="https://dl.dropbox.com/u/30688032/KC1.arff" rel="nofollow">https://dl.dropbox.com/u/30688032/KC1.arff</a>. For WEKA, I simply downloaded the .jar file from <a href="http://www.cs.waikato.ac.nz/ml/weka/downloading.html" rel="nofollow">http://www.cs.waikato.ac.nz/ml/weka/downloading.html</a>. In WEKA, I set the cross-validation parameter as 10-fold, the dataset as KC1.arff, the algorithm as "RandomForest -l 19 -K 0 -S 1". Then ran the code! Once you generate the results in WEKA, it should be saved as a file, .csv or .arff. Read that file and check the column 'Area_under_ROC', it should be somewhat close to 0.79.</p> <p>Below is the code for the scikit's RandomForest</p> <pre><code>import numpy as np from pandas import * from sklearn.ensemble import RandomForestClassifier def read_arff(f): from scipy.io import arff data, meta = arff.loadarff(f) return DataFrame(data) def kfold(clr,X,y,folds=10): from sklearn.cross_validation import StratifiedKFold from sklearn import metrics auc_sum=0 kf = StratifiedKFold(y, folds) for train_index, test_index in kf: X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] clr.fit(X_train, y_train) pred_test = clr.predict(X_test) print metrics.auc_score(y_test,pred_test) auc_sum+=metrics.auc_score(y_test,pred_test) print 'AUC: ', auc_sum/folds print "----------------------------" #read the dataset X=read_arff('KC1.arff') y=X['Defective'] #changes N, and Y to 0, and 1 respectively s = np.unique(y) mapping = Series([x[0] for x in enumerate(s)], index = s) y=y.map(mapping) del X['Defective'] #initialize random forests (by defualt it is set to 10 trees) rf=RandomForestClassifier() #run algorithm kfold(rf,np.array(X),y) #You will get an average AUC around 0.62 as opposed to 0.79 in WEKA </code></pre> <p>Please keep in mind that the real auc value, as shown in relevant papers' experimental results, is around 0.79, so the problem lies on my implementation that uses the scikit random forests.</p> <p>Your kind help will be highly appreciated!!</p> <p>Thank you very much!</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload