StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POWhy does Weka RandomForest gives me a different result than Scikit RandomForestClassifier?
primarykey
Id
14936091
data
AcceptedAnswerId
14996766
AnswerCount
1
ClosedDate
CommentCount
10
CommunityOwnedDate
CreationDate
2013-02-18T12:16:33.267
FavoriteCount
0
LastActivityDate
2013-02-21T07:28:53.210
LastEditDate
2013-02-18T16:14:00.273
LastEditorUserId
1248073
OwnerUserId
1248073
ParentId
0
PostTypeId
1
Score
3
ViewCount
1879
LastEditorDisplayName
text
Body
I am getting peculiar differences in results between WEKA and scikit while using the same RandomForest technique and the same dataset. With scikit I am getting an AUC around 0.62 (all the time, for I did extensive testing). However, with WEKA, im getting results close to 0.79. Thats a huge difference! The dataset I tested the algorithms on is KC1.arff, of which I put a copy in my public dropbox folder <a href="https://dl.dropbox.com/u/30688032/KC1.arff" rel="nofollow">https://dl.dropbox.com/u/30688032/KC1.arff</a>. For WEKA, I simply downloaded the .jar file from <a href="http://www.cs.waikato.ac.nz/ml/weka/downloading.html" rel="nofollow">http://www.cs.waikato.ac.nz/ml/weka/downloading.html</a>. In WEKA, I set the cross-validation parameter as 10-fold, the dataset as KC1.arff, the algorithm as "RandomForest -l 19 -K 0 -S 1". Then ran the code! Once you generate the results in WEKA, it should be saved as a file, .csv or .arff. Read that file and check the column 'Area_under_ROC', it should be somewhat close to 0.79. Below is the code for the scikit's RandomForest <pre><code>import numpy as np from pandas import * from sklearn.ensemble import RandomForestClassifier def read_arff(f): from scipy.io import arff data, meta = arff.loadarff(f) return DataFrame(data) def kfold(clr,X,y,folds=10): from sklearn.cross_validation import StratifiedKFold from sklearn import metrics auc_sum=0 kf = StratifiedKFold(y, folds) for train_index, test_index in kf: X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] clr.fit(X_train, y_train) pred_test = clr.predict(X_test) print metrics.auc_score(y_test,pred_test) auc_sum+=metrics.auc_score(y_test,pred_test) print 'AUC: ', auc_sum/folds print "----------------------------" #read the dataset X=read_arff('KC1.arff') y=X['Defective'] #changes N, and Y to 0, and 1 respectively s = np.unique(y) mapping = Series([x[0] for x in enumerate(s)], index = s) y=y.map(mapping) del X['Defective'] #initialize random forests (by defualt it is set to 10 trees) rf=RandomForestClassifier() #run algorithm kfold(rf,np.array(X),y) #You will get an average AUC around 0.62 as opposed to 0.79 in WEKA </code></pre> Please keep in mind that the real auc value, as shown in relevant papers' experimental results, is around 0.79, so the problem lies on my implementation that uses the scikit random forests. Your kind help will be highly appreciated!! Thank you very much!
Tags
<weka><scikit-learn><random-forest>
Title
Why does Weka RandomForest gives me a different result than Scikit RandomForestClassifier?
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USCurious
UserOwnerUserId
1. USCurious
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POWhy does Weka RandomForest gives me a different result than Scikit RandomForestClassifier?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.