Note that there are some explanatory texts on larger screens.

plurals
  1. PONew factor levels not present in the training data
    primarykey
    data
    text
    <p>When trying to use the output of <code>randomForest</code> to classify new data (or even the original training data), I get the following error:</p> <pre><code>&gt; res.rf5 &lt;- predict(model.rf5, train.rf5) Error in predict.randomForest(model.rf5, train.rf5) : New factor levels not present in the training data </code></pre> <p>What does this error mean? Why does this error occur even when I try to predict the same data I used to train?</p> <p>A small example that can be used to reproduce the error is below.</p> <pre><code>train.rf5 &lt;- structure( list(A = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 3L), .Label = c("(-0.1,19.9]", "(19.9,40]", "(80.1,100]"), class = c("ordered", "factor")), B = structure(c(3L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 4L), .Label = c("1", "2", "4", "5"), class = c("ordered", "factor")), C = structure(c(1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("FALSE", "TRUE"), class = "factor")), .Names = c("A", "B", "C"), row.names = c(7L, 8L, 10L, 11L, 13L, 15L, 16L, 17L, 18L, 19L), class = "data.frame") # A B C # 7 (19.9,40] 4 FALSE # 8 (-0.1,19.9] 1 FALSE # 10 (-0.1,19.9] 1 TRUE # 11 (-0.1,19.9] 1 FALSE # 13 (-0.1,19.9] 1 FALSE # 15 (-0.1,19.9] 1 TRUE # 16 (80.1,100] 2 TRUE # 17 (-0.1,19.9] 1 FALSE # 18 (-0.1,19.9] 1 FALSE # 19 (80.1,100] 5 TRUE require(randomForest) model.rf5 &lt;- randomForest(C ~ ., data = train.rf5) res.rf5 &lt;- predict(model.rf5, train.rf5) # Causes error </code></pre> <p>I see some possibly related questions on SO, but I don't think they solve my issue directly</p> <ol> <li><a href="https://stackoverflow.com/questions/1195826/dropping-factor-levels-in-a-subsetted-data-frame-in-r" title="dropping factor levels in a subsetted data frame in R">dropping factor levels in a subsetted data frame in R</a></li> <li><a href="https://stackoverflow.com/questions/17059432/random-forest-package-in-r-shows-error-during-prediction-if-there-are-new-fact">Random forest package in R shows error during prediction() if there are new factor levels present in test data. Is there any way to avoid this error?</a></li> </ol> <p>Unlike 1), I do not have factor levels that are not represented in the data, and unlike 2), the factor levels in my train and test data are identical.</p> <p>Edit: Additional information:</p> <pre><code>sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] randomForest_4.6-7 loaded via a namespace (and not attached): [1] tools_3.0.1 </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload