StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POSum duplicates then remove all but first occurrence
text
Body
copied!<p>I have a data frame (~5000 rows, 6 columns) that contains some duplicate values for an <code>id</code> variable. I have another continuous variable <code>x</code>, whose values I would like to sum for each duplicate <code>id</code>. The observations are time dependent, there are <code>year</code> and <code>month</code> variables, and I'd like to keep the chronologically first observation of each duplicate <code>id</code> and add the subsequent dupes to this first observation. </p> <p>I've included dummy data that resembles what I have: <code>dat1</code>. I've also included a data set that shows the structure of my desired outcome: <code>outcome</code>.</p> <p>I've tried two strategies, neither of which quite give me what I want (see below). The first strategy gives me the correct values for <code>x</code>, but I loose my year and month columns - I need to retain these for all the first duplicate <code>id</code> values. The second strategy doesn't sum the values of <code>x</code> correctly.</p> <p>Any suggestions for how to get my desired outcome would be much appreciated.</p> <pre><code># dummy data set set.seed(179) dat1 <- data.frame(id = c(1234, 1321, 4321, 7423, 4321, 8503, 2961, 1234, 8564, 1234), year = rep(c("2006", "2007"), each = 5), month = rep(c("December", "January"), each = 5), x = round(rnorm(10, 10, 3), 2)) # desired outcome outcome <- data.frame(id = c(1234, 1321, 4321, 7423, 8503, 2961, 8564), year = c(rep("2006", 4), rep("2007", 3)), month = c(rep("December", 4), rep("January", 3)), x = c(36.42, 11.55, 17.31, 5.97, 12.48, 10.22, 11.41)) # strategy 1: library(plyr) dat2 <- ddply(dat1, .(id), summarise, x = sum(x)) # strategy 2: # partition into two data frames - one with unique cases, one with dupes dat1_unique <- dat1[!duplicated(dat1$id), ] dat1_dupes <- dat1[duplicated(dat1$id), ] # merge these data frames while summing the x variable for duplicated ids # with plyr dat3 <- ddply(merge(dat1_unique, dat1_dupes, all.x = TRUE), .(id), summarise, x = sum(x)) # in base R dat4 <- aggregate(x ~ id, data = merge(dat1_unique, dat1_dupes, all.x = TRUE), FUN = sum) </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload