Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to split a data.table by groups and use subset by occourences in a columns?
    primarykey
    data
    text
    <p>I have a large dataset, 287046 x 18, that looks like this (only a partial representation): </p> <pre><code>tdf geneSymbol peaks 16 AK056486 Pol2_only 13 AK310751 no_peak 7 BC036251 no_peak 10 DQ575786 no_peak 4 DQ597235 no_peak 5 DQ599768 no_peak 11 DQ599872 no_peak 12 DQ599872 no_peak 2 FAM138F no_peak 15 FAM41C no_peak 34116 GAPDH both 283034 GAPDH Pol2_only 6 LOC100132062 no_peak 9 LOC100133331 no_peak 14 LOC100288069 both 8 M37726 no_peak 3 OR4F5 no_peak 17 SAMD11 both 18 SAMD11 both 19 SAMD11 both 20 SAMD11 both 21 SAMD11 both 22 SAMD11 both 23 SAMD11 both 24 SAMD11 both 25 SAMD11 both 1 WASH7P Pol2_only </code></pre> <p>What I want to do is extract (1) the geneSymbols that are either "Pol2_only" or "both" and then; (2) just the geneSymbols that are "Pol2_only" but not "both". For example, GAPDH would fulfil condition 1 but not 2. </p> <p>I've tried plyr with something like this (there is an extra condition there, please ignore):</p> <pre><code>## grab genes with both peaks pol2.peaks &lt;- ddply(filem, .(geneSymbol), function(dfrm) subset(dfrm, peaks == "both" | (peaks == "Pol2_only" &amp; peaks == "CBP20_only")), .parallel=TRUE) ## grab genes pol2 only peaks pol2.only.peaks &lt;- ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, peaks != "both" &amp; peaks == "Pol2_only" &amp; peaks != "CBP20_only"), .parallel=TRUE) </code></pre> <p>But it takes a long time and still returns the wrong answer. For instance, the answer for 2 is:</p> <pre><code>pol2.only.peaks geneSymbol peaks 1 AK056486 Pol2_only 2 GAPDH Pol2_only 3 WASH7P Pol2_only </code></pre> <p>As you can see GAPDH should not be there. My implementation in data.table (which is much prefer and thus preferred) also yields the same result:</p> <pre><code>filem.dt &lt;- as.data.table(tdf) setkey(filem.dt, "geneSymbol") test.dt &lt;- filem.dt[ , .SD[ peaks != "both" &amp; peaks == "Pol2_only" &amp; peaks != "CBP20_only"]] test.dt geneSymbol peaks 1: AK056486 Pol2_only 2: GAPDH Pol2_only 3: WASH7P Pol2_only </code></pre> <p>The issue seems to be that the subsetting is working on a row-by-row basis whereas, I need it to be applied on the subgroup of geneSymbol as a whole. </p> <p>Could please help me subset on the group? A data.table solution would be welcome because it is faster but plyr (or even base R) is fine. A solution that adds an extra column noting the nature of the peak would be perfect. This is what I mean:</p> <pre><code>tdf geneSymbol peaks newCol 16 AK056486 Pol2_only Pol2_only 13 AK310751 no_peak no_peak 7 BC036251 no_peak no_peak 10 DQ575786 no_peak no_peak 4 DQ597235 no_peak no_peak 5 DQ599768 no_peak no_peak 11 DQ599872 no_peak no_peak 12 DQ599872 no_peak no_peak 2 FAM138F no_peak no_peak 15 FAM41C no_peak no_peak 34116 GAPDH both both 283034 GAPDH Pol2_only both 6 LOC100132062 no_peak no_peak 9 LOC100133331 no_peak no_peak 14 LOC100288069 both both 8 M37726 no_peak no_peak 3 OR4F5 no_peak no_peak 17 SAMD11 both both 18 SAMD11 both both 19 SAMD11 both both 20 SAMD11 both both 21 SAMD11 both both 22 SAMD11 both both 23 SAMD11 both both 24 SAMD11 both both 25 SAMD11 both both 1 WASH7P Pol2_only Pol2_only </code></pre> <p>Notice again the GAPDH that is now "both" in the 2 rows. Here is the data:</p> <pre><code>dput(tdf) structure(list(geneSymbol = c("AK056486", "AK310751", "BC036251", "DQ575786", "DQ597235", "DQ599768", "DQ599872", "DQ599872", "FAM138F", "FAM41C", "GAPDH", "GAPDH", "LOC100132062", "LOC100133331", "LOC100288069", "M37726", "OR4F5", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "WASH7P"), peaks = c("Pol2_only", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "both", "Pol2_only", "no_peak", "no_peak", "both", "no_peak", "no_peak", "both", "both", "both", "both", "both", "both", "both", "both", "both", "Pol2_only")), .Names = c("geneSymbol", "peaks"), row.names = c(16L, 13L, 7L, 10L, 4L, 5L, 11L, 12L, 2L, 15L, 34116L, 283034L, 6L, 9L, 14L, 8L, 3L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 1L), class = "data.frame") </code></pre> <p>Thank you!</p> <p><em><strong></em> edit **</strong> I've found a workaround for the problem. The selection was being done row-by-row. All it is needed is a hack, that is, that in the logical vector that is returned ALL values are true. So here is what I did with the plyr function:</p> <pre><code>ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, all(peaks != "both" &amp; peaks == "Pol2_only" &amp; peaks != "CBP20_only")), .parallel=TRUE) geneSymbol peaks 1 AK056486 Pol2_only 2 WASH7P Pol2_only </code></pre> <p>Note the use of all in alongside the conditions. Now the results is the expected, that is, "Pol2_only" only (redundancy alert) genes :) What is still left to be done is the implementation in data.table which I tried but failed to do. Any help? </p> <p>I have not written an answer to my question in expectation that someone comes along with a better solution in data.table.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload