StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>I, too, would recommend <code>data.table</code> here, but since you asked for an <code>aggregate</code> solution, here is one which combines <code>aggregate</code> and <code>merge</code> to get all the columns:</p> <pre><code>merge(events22, aggregate(saleDate ~ custId, events22, max)) </code></pre> <p>Or just <code>aggregate</code> if you only want the "custId" and "DelivDate" columns:</p> <pre><code>aggregate(list(DelivDate = events22$saleDate), list(custId = events22$custId), function(x) events22[["DelivDate"]][which.max(x)]) </code></pre> <p>Finally, here's an option using <code>sqldf</code>:</p> <pre><code>library(sqldf) sqldf("select custId, DelivDate, max(saleDate) `saleDate` from events22 group by custId") </code></pre> <hr> <h3>Benchmarks</h3> <p>I'm not a benchmarking or <code>data.table</code> expert, but it surprised me that <code>data.table</code> is not faster here. <em>My suspicion is that the results would be quite different on a larger dataset</em>, say for instance, your 400k lines one. Anyway, here's some benchmarking code <a href="https://stackoverflow.com/a/13713220/1270695">modeled after @mnel's answer here</a> so you can do some tests on your actual dataset for future reference.</p> <pre><code>library(rbenchmark) </code></pre> <p>First, set up your functions for what you want to benchmark.</p> <pre><code>DDPLY <- function() { x <- ddply(events22, .(custId), .inform = T, function(x) { x[x$saleDate == max(x$saleDate),"DelivDate"]}) } DATATABLE <- function() { x <- dt[, .SD[which.max(saleDate), ], by = custId] } AGG1 <- function() { x <- merge(events22, aggregate(saleDate ~ custId, events22, max)) } AGG2 <- function() { x <- aggregate(list(DelivDate = events22$saleDate), list(custId = events22$custId), function(x) events22[["DelivDate"]][which.max(x)]) } SQLDF <- function() { x <- sqldf("select custId, DelivDate, max(saleDate) `saleDate` from events22 group by custId") } DOCALL <- function() { do.call(rbind, lapply(split(events22, events22$custId), function(x){ x[which.max(x$saleDate), ] }) ) } </code></pre> <p>Second, do the benchmarking.</p> <pre><code>benchmark(DDPLY(), DATATABLE(), AGG1(), AGG2(), SQLDF(), DOCALL(), order = "elapsed")[1:5] # test replications elapsed relative user.self # 4 AGG2() 100 0.285 1.000 0.284 # 3 AGG1() 100 0.891 3.126 0.896 # 6 DOCALL() 100 1.202 4.218 1.204 # 2 DATATABLE() 100 1.251 4.389 1.248 # 1 DDPLY() 100 1.254 4.400 1.252 # 5 SQLDF() 100 2.109 7.400 2.108 </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload