Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Here's a much faster <code>data.table</code> function:</p> <pre><code>DATATABLE &lt;- function() { dt &lt;- data.table(events, key=c('custId', 'saleDate')) dt[, maxrow := 1:.N==.N, by = custId] return(dt[maxrow==TRUE, list(custId, DelivDate)]) } </code></pre> <p>Note that this function creates a <code>data.table</code> and sorts the data, which is a step you'd only need to perform once. If you remove this step (perhaps you have a multi-step data processing pipeline, and create the <code>data.table</code> once, as a first step), the function is more than twice as fast.</p> <p>I also modified all the previous functions to return the result, for easier comparison:</p> <pre><code>DDPLY &lt;- function() { return(ddply(events, .(custId), .inform = T, function(x) { x[x$saleDate == max(x$saleDate),"DelivDate"]})) } AGG1 &lt;- function() { return(merge(events, aggregate(saleDate ~ custId, events, max)))} SQLDF &lt;- function() { return(sqldf("select custId, DelivDate, max(saleDate) `saleDate` from events group by custId"))} DOCALL &lt;- function() { return(do.call(rbind, lapply(split(events, events$custId), function(x){ x[which.max(x$saleDate), ] }) )) } </code></pre> <p>Here's the results for 10k rows, repeated 10 times:</p> <pre><code>library(rbenchmark) library(plyr) library(data.table) library(sqldf) events &lt;- do.call(rbind, lapply(1:500, function(x) events22)) events$custId &lt;- sample(1:nrow(events), nrow(events)) benchmark(a &lt;- DDPLY(), b &lt;- DATATABLE(), c &lt;- AGG1(), d &lt;- SQLDF(), e &lt;- DOCALL(), order = "elapsed", replications=10)[1:5] test replications elapsed relative user.self 2 b &lt;- DATATABLE() 10 0.13 1.000 0.13 4 d &lt;- SQLDF() 10 0.42 3.231 0.41 3 c &lt;- AGG1() 10 12.11 93.154 12.03 1 a &lt;- DDPLY() 10 32.17 247.462 32.01 5 e &lt;- DOCALL() 10 56.05 431.154 55.85 </code></pre> <p>Since all the functions return their results, we can verify they all return the same answer:</p> <pre><code>c &lt;- c[order(c$custId),] dim(a); dim(b); dim(c); dim(d); dim(e) all(a$V1==b$DelivDate) all(a$V1==c$DelivDate) all(a$V1==d$DelivDate) all(a$V1==e$DelivDate) </code></pre> <p>/Edit: On the smaller, 20 row dataset, <code>data.table</code> is still the fastest, but by a thinner margin:</p> <pre><code> test replications elapsed relative user.self 2 b &lt;- DATATABLE() 100 0.22 1.000 0.22 3 c &lt;- AGG1() 100 0.42 1.909 0.42 5 e &lt;- DOCALL() 100 0.48 2.182 0.49 1 a &lt;- DDPLY() 100 0.55 2.500 0.55 4 d &lt;- SQLDF() 100 1.00 4.545 0.98 </code></pre> <p>/Edit2: If we remove the <code>data.table</code> creation from the function we get the following results:</p> <pre><code>dt &lt;- data.table(events, key=c('custId', 'saleDate')) DATATABLE2 &lt;- function() { dt[, maxrow := 1:.N==.N, by = custId] return(dt[maxrow==TRUE, list(custId, DelivDate)]) } benchmark(a &lt;- DDPLY(), b &lt;- DATATABLE2(), c &lt;- AGG1(), d &lt;- SQLDF(), e &lt;- DOCALL(), order = "elapsed", replications=10)[1:5] test replications elapsed relative user.self 2 b &lt;- DATATABLE() 10 0.09 1.000 0.08 4 d &lt;- SQLDF() 10 0.41 4.556 0.39 3 c &lt;- AGG1() 10 11.73 130.333 11.67 1 a &lt;- DDPLY() 10 31.59 351.000 31.50 5 e &lt;- DOCALL() 10 55.05 611.667 54.91 </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload