Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to speed up summarise and ddply?
    primarykey
    data
    text
    <p>I have a data frame with 2 million rows, and 15 columns. I want to group by 3 of these columns with ddply (all 3 are factors, and there are 780,000 unique combinations of these factors), and get the weighted mean of 3 columns (with weights defined by my data set). The following is reasonably quick:</p> <pre><code>system.time(a2 &lt;- aggregate(cbind(col1,col2,col3) ~ fac1 + fac2 + fac3, data=aggdf, FUN=mean)) user system elapsed 91.358 4.747 115.727 </code></pre> <p>The problem is that I want to use weighted.mean instead of mean to calculate my aggregate columns.</p> <p>If I try the following ddply on the same data frame (note, I cast to immutable), the following does not finish after 20 minutes:</p> <pre><code>x &lt;- ddply(idata.frame(aggdf), c("fac1","fac2","fac3"), summarise, w=sum(w), col1=weighted.mean(col1, w), col2=weighted.mean(col2, w), col3=weighted.mean(col3, w)) </code></pre> <p>This operation seems to be CPU hungry, but not very RAM-intensive.</p> <p>EDIT: So I ended up writing this little function, which "cheats" a bit by taking advantage of some properties of weighted mean and does a multiplication and a division on the whole object, rather than on the slices. </p> <pre><code>weighted_mean_cols &lt;- function(df, bycols, aggcols, weightcol) { df[,aggcols] &lt;- df[,aggcols]*df[,weightcol] df &lt;- aggregate(df[,c(weightcol, aggcols)], by=as.list(df[,bycols]), sum) df[,aggcols] &lt;- df[,aggcols]/df[,weightcol] df } </code></pre> <p>When I run as:</p> <pre><code>a2 &lt;- weighted_mean_cols(aggdf, c("fac1","fac2","fac3"), c("col1","col2","col3"),"w") </code></pre> <p>I get good performance, and somewhat reusable, elegant code.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload