StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POR use ddply or aggregate
text
Body
copied!<p>I have a data frame with 3 columns: custId, saleDate, DelivDateTime.</p> <pre><code>> head(events22) custId saleDate DelivDate 1 280356593 2012-11-14 14:04:59 11/14/12 17:29 2 280367076 2012-11-14 17:04:44 11/14/12 20:48 3 280380097 2012-11-14 17:38:34 11/14/12 20:45 4 280380095 2012-11-14 20:45:44 11/14/12 23:59 5 280380095 2012-11-14 20:31:39 11/14/12 23:49 6 280380095 2012-11-14 19:58:32 11/15/12 00:10 </code></pre> <p>Here's the dput:</p> <pre><code>> dput(events22) structure(list(custId = c(280356593L, 280367076L, 280380097L, 280380095L, 280380095L, 280380095L, 280364279L, 280364279L, 280398506L, 280336395L, 280364376L, 280368458L, 280368458L, 280368456L, 280368456L, 280364225L, 280391721L, 280353458L, 280387607L, 280387607L), saleDate = structure(c(1352901899.215, 1352912684.484, 1352914714.971, 1352925944.429, 1352925099.247, 1352923112.636, 1352922476.55, 1352920666.968, 1352915226.534, 1352911135.077, 1352921349.592, 1352911494.975, 1352910529.86, 1352924755.295, 1352907511.476, 1352920108.577, 1352906160.883, 1352905925.134, 1352916810.309, 1352916025.673), class = c("POSIXct", "POSIXt"), tzone = "UTC"), DelivDate = c("11/14/12 17:29", "11/14/12 20:48", "11/14/12 20:45", "11/14/12 23:59", "11/14/12 23:49", "11/15/12 00:10", "11/14/12 23:35", "11/14/12 22:59", "11/14/12 20:53", "11/14/12 19:52", "11/14/12 23:01", "11/14/12 19:47", "11/14/12 19:42", "11/14/12 23:31", "11/14/12 23:33", "11/14/12 22:45", "11/14/12 18:11", "11/14/12 18:12", "11/14/12 19:17", "11/14/12 19:19")), .Names = c("custId", "saleDate", "DelivDate" ), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20" ), class = "data.frame") </code></pre> <p>I'm trying to find the <code>DelivDate</code> for the most recent <code>saleDate</code> for each <code>custId</code>.</p> <p>I can do that using plyr::ddply like this:</p> <pre><code>dd1 <-ddply(events22, .(custId),.inform = T, function(x){ x[x$saleDate == max(x$saleDate),"DelivDate"] }) </code></pre> <p>My question is whether there is a faster way to do this as the ddply method is a bit time consuming (the full data set is ~ 400k lines). I've looked at using <code>aggregate()</code> but don't know how to get a value other than the one I'm sorting by.</p> <p>Any suggestions?</p> <p>EDIT:</p> <p>Here's the benchmark results for 10k lines @ 10 iterations:</p> <pre><code> test replications elapsed relative user.self 2 AGG2() 10 5.96 1.000 5.93 1 AGG1() 10 20.87 3.502 20.75 5 DATATABLE() 10 61.32 1 60.31 3 DDPLY() 10 80.04 13.430 79.63 4 DOCALL() 10 90.43 15.173 88.39 </code></pre> <p>EDIT2 : While being quickest AGG2() doesn't give the correct answer.</p> <pre><code> > head(agg2) custId saleDate DelivDate 1 280336395 2012-11-14 16:38:55 11/14/12 19:52 2 280353458 2012-11-14 15:12:05 11/14/12 18:12 3 280356593 2012-11-14 14:04:59 11/14/12 17:29 4 280364225 2012-11-14 19:08:28 11/14/12 22:45 5 280364279 2012-11-14 19:47:56 11/14/12 23:35 6 280364376 2012-11-14 19:29:09 11/14/12 23:01 > agg2 <- AGG2() > head(agg2) custId DelivDate 1 280336395 11/14/12 17:29 2 280353458 11/14/12 17:29 3 280356593 11/14/12 17:29 4 280364225 11/14/12 17:29 5 280364279 11/14/12 17:29 6 280364376 11/14/12 17:29 > agg2 <- DDPLY() > head(agg2) custId V1 1 280336395 11/14/12 19:52 2 280353458 11/14/12 18:12 3 280356593 11/14/12 17:29 4 280364225 11/14/12 22:45 5 280364279 11/14/12 23:35 6 280364376 11/14/12 23:01 </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload