Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to remove training data from party:::ctree models?
    primarykey
    data
    text
    <p>I created several ctree models (about 40 to 80) which I want evaluate rather often. </p> <p>An issue is that the model objects are very big (40 models require more than 2.8G of memory) and it appears to me, that they stored the training data, maybe as modelname@data and modelname@responses, and not just the informations relevant to predict new data. </p> <p>Most other R learning packages have configurable options whether to include the data in the model object, but I couldn't find any hints in the documentation. I also tried to assign empty ModelEnv objects by </p> <pre><code>modelname@data &lt;- new("ModelEnv") </code></pre> <p>but there was no effect on the size of the respective RData file. </p> <p>Anyone knows whether ctree really stores the training data and how to remove all data from ctree models that are irrelevant for new predictions so that I can fit many of them in memory?</p> <p>Thanks a lot,</p> <p>Stefan</p> <hr> <p>Thank you for your feedback, that was already very helpful.</p> <p>I used <code>dput</code> and <code>str</code> to take a deeper look at the object and found that no training data is included in the model, but there is a <code>responses</code> slot, which seems to have the training labels and rownames. Anyways, I noticed that each node has a weight vector for each training sample. After a while of inspecting the code, I ended up googling a bit and found the following comment in the <code>party</code> NEWS log:</p> <pre><code> CHANGES IN party VERSION 0.9-13 (2007-07-23) o update `mvt.f' o improve the memory footprint of RandomForest objects substancially (by removing the weights slots from each node). </code></pre> <p>It turns out, there is a C function in the party package to remove these weights called <code>R_remove_weights</code> with the following definition:</p> <pre><code>SEXP R_remove_weights(SEXP subtree, SEXP removestats) { C_remove_weights(subtree, LOGICAL(removestats)[0]); return(R_NilValue); } </code></pre> <p>It also works fine:</p> <pre><code># cc is my model object sum(unlist(lapply(slotNames(cc), function (x) object.size(slot(cc, x))))) # returns: [1] 2521256 save(cc, file="cc_before.RData") .Call("R_remove_weights", cc@tree, TRUE, PACKAGE="party") # returns NULL and removes weights and node statistics sum(unlist(lapply(slotNames(cc), function (x) object.size(slot(cc, x))))) # returns: [1] 1521392 save(cc, file="cc_after.RData") </code></pre> <p>As you can see, it reduces the object size substantially, from roughly 2.5MB to 1.5MB.</p> <p>What is strange, though, is that the corresponding RData files are insanely huge, and there is no impact on them: </p> <pre><code>$ ls -lh cc* -rw-r--r-- 1 user user 9.6M Aug 24 15:44 cc_after.RData -rw-r--r-- 1 user user 9.6M Aug 24 15:43 cc_before.RData </code></pre> <p>Unzipping the file shows the 2.5MB object to occupy nearly 100MB of space:</p> <pre><code>$ cp cc_before.RData cc_before.gz $ gunzip cc_before.gz $ ls -lh cc_before* -rw-r--r-- 1 user user 98M Aug 24 15:45 cc_before </code></pre> <p>Any ideas, what could cause this?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload