StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow to split a data.table by groups and use subset by occourences in a columns?
primarykey
Id
17943623
data
AcceptedAnswerId
17945725
AnswerCount
2
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2013-07-30T09:46:05.950
FavoriteCount
0
LastActivityDate
2013-07-31T00:53:41.530
LastEditDate
2013-07-30T10:58:18.140
LastEditorUserId
1274242
OwnerUserId
1274242
ParentId
0
PostTypeId
1
Score
4
ViewCount
4540
LastEditorDisplayName
text
Body
I have a large dataset, 287046 x 18, that looks like this (only a partial representation): <pre><code>tdf geneSymbol peaks 16 AK056486 Pol2_only 13 AK310751 no_peak 7 BC036251 no_peak 10 DQ575786 no_peak 4 DQ597235 no_peak 5 DQ599768 no_peak 11 DQ599872 no_peak 12 DQ599872 no_peak 2 FAM138F no_peak 15 FAM41C no_peak 34116 GAPDH both 283034 GAPDH Pol2_only 6 LOC100132062 no_peak 9 LOC100133331 no_peak 14 LOC100288069 both 8 M37726 no_peak 3 OR4F5 no_peak 17 SAMD11 both 18 SAMD11 both 19 SAMD11 both 20 SAMD11 both 21 SAMD11 both 22 SAMD11 both 23 SAMD11 both 24 SAMD11 both 25 SAMD11 both 1 WASH7P Pol2_only </code></pre> What I want to do is extract (1) the geneSymbols that are either "Pol2_only" or "both" and then; (2) just the geneSymbols that are "Pol2_only" but not "both". For example, GAPDH would fulfil condition 1 but not 2. I've tried plyr with something like this (there is an extra condition there, please ignore): <pre><code>## grab genes with both peaks pol2.peaks <- ddply(filem, .(geneSymbol), function(dfrm) subset(dfrm, peaks == "both" | (peaks == "Pol2_only" & peaks == "CBP20_only")), .parallel=TRUE) ## grab genes pol2 only peaks pol2.only.peaks <- ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"), .parallel=TRUE) </code></pre> But it takes a long time and still returns the wrong answer. For instance, the answer for 2 is: <pre><code>pol2.only.peaks geneSymbol peaks 1 AK056486 Pol2_only 2 GAPDH Pol2_only 3 WASH7P Pol2_only </code></pre> As you can see GAPDH should not be there. My implementation in data.table (which is much prefer and thus preferred) also yields the same result: <pre><code>filem.dt <- as.data.table(tdf) setkey(filem.dt, "geneSymbol") test.dt <- filem.dt[ , .SD[ peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"]] test.dt geneSymbol peaks 1: AK056486 Pol2_only 2: GAPDH Pol2_only 3: WASH7P Pol2_only </code></pre> The issue seems to be that the subsetting is working on a row-by-row basis whereas, I need it to be applied on the subgroup of geneSymbol as a whole. Could please help me subset on the group? A data.table solution would be welcome because it is faster but plyr (or even base R) is fine. A solution that adds an extra column noting the nature of the peak would be perfect. This is what I mean: <pre><code>tdf geneSymbol peaks newCol 16 AK056486 Pol2_only Pol2_only 13 AK310751 no_peak no_peak 7 BC036251 no_peak no_peak 10 DQ575786 no_peak no_peak 4 DQ597235 no_peak no_peak 5 DQ599768 no_peak no_peak 11 DQ599872 no_peak no_peak 12 DQ599872 no_peak no_peak 2 FAM138F no_peak no_peak 15 FAM41C no_peak no_peak 34116 GAPDH both both 283034 GAPDH Pol2_only both 6 LOC100132062 no_peak no_peak 9 LOC100133331 no_peak no_peak 14 LOC100288069 both both 8 M37726 no_peak no_peak 3 OR4F5 no_peak no_peak 17 SAMD11 both both 18 SAMD11 both both 19 SAMD11 both both 20 SAMD11 both both 21 SAMD11 both both 22 SAMD11 both both 23 SAMD11 both both 24 SAMD11 both both 25 SAMD11 both both 1 WASH7P Pol2_only Pol2_only </code></pre> Notice again the GAPDH that is now "both" in the 2 rows. Here is the data: <pre><code>dput(tdf) structure(list(geneSymbol = c("AK056486", "AK310751", "BC036251", "DQ575786", "DQ597235", "DQ599768", "DQ599872", "DQ599872", "FAM138F", "FAM41C", "GAPDH", "GAPDH", "LOC100132062", "LOC100133331", "LOC100288069", "M37726", "OR4F5", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "WASH7P"), peaks = c("Pol2_only", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "both", "Pol2_only", "no_peak", "no_peak", "both", "no_peak", "no_peak", "both", "both", "both", "both", "both", "both", "both", "both", "both", "Pol2_only")), .Names = c("geneSymbol", "peaks"), row.names = c(16L, 13L, 7L, 10L, 4L, 5L, 11L, 12L, 2L, 15L, 34116L, 283034L, 6L, 9L, 14L, 8L, 3L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 1L), class = "data.frame") </code></pre> Thank you! edit ** I've found a workaround for the problem. The selection was being done row-by-row. All it is needed is a hack, that is, that in the logical vector that is returned ALL values are true. So here is what I did with the plyr function: <pre><code>ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, all(peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only")), .parallel=TRUE) geneSymbol peaks 1 AK056486 Pol2_only 2 WASH7P Pol2_only </code></pre> Note the use of all in alongside the conditions. Now the results is the expected, that is, "Pol2_only" only (redundancy alert) genes :) What is still left to be done is the implementation in data.table which I tried but failed to do. Any help? I have not written an answer to my question in expectation that someone comes along with a better solution in data.table.
Tags
<r><split><data.table><plyr><subset>
Title
How to split a data.table by groups and use subset by occourences in a columns?
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USfridaymeetssunday
UserOwnerUserId
1. USfridaymeetssunday
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POHow to split a data.table by groups and use subset by occourences in a columns?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POHow to split a data.table by groups and use subset by occourences in a columns?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.