Note that there are some explanatory texts on larger screens.

plurals
  1. POSpeed up 3-column R search in large data.frame
    primarykey
    data
    text
    <p>Alright. I've heavily edited this question to a) make it make more sense, and b) reflect where I'm at with the problem.</p> <p>I have two data sets -- let's call them set1 and set2 -- each of about 6 million rows. Currently, I have them loaded into R as data.tables.</p> <pre><code>&gt;set1&lt;-data.table(read.csv('~/file1.csv', stringsAsFactors=F)) &gt;setkey(set1, id1) &gt;head(set1) id1 start_unixtime end_unixtime seconds_diff id2 1: 1674 1354741858 1354741858 0 227167461 2: 1674 1354752386 1354752951 565 227246263 3: 1674 1354764412 1354764412 0 227358796 4: 1674 1354773044 1354773776 732 227421295 5: 1674 1354778651 1354778651 0 227448774 6: 1674 1354810424 1354810424 0 227631113 &gt;set2&lt;-data.table(read.csv('~/file2.csv', stringsAsFactors=F)) &gt;setkey(set2, id1) &gt;head(set2) id1 unix_timestamp event_name 1: 1674 1355202784 join 2: 1674 1354351118 join 3: 1674 1354349648 play 4: 1674 1354780517 join 5: 1674 1355278891 join 6: 1674 1354617262 join </code></pre> <p>One problematic detail to point out: set2 has no unique keys. Only the tuple of each row is actually unique. In set1, id2 is unique. Fun times!</p> <p>The operation I'm performing goes like this: for every row in <code>set2</code>, I need to take the unix_timestamp, find the row in <code>set1</code> where <code>start_unixtimestamp &lt;= unix_timestamp &lt;= end_unixtimestamp</code> and id1 matches, then assign the corresponding <code>set1.id2</code> to the appropriate row in <code>set2</code>. Every row in <code>set2</code> has an entry in <code>set1</code>, but not every row in <code>set1</code> has an entry in <code>set2</code>. One <code>id2</code> can be assigned to many rows in <code>set2</code>. What I need to wind up with is this (NOTE: the following data is fake, as I haven't been able to produce any actual success yet.):</p> <pre><code>&gt;head(set2) id1 unix_timestamp event_name id2 1: 1674 1355202784 join 227167461 2: 1674 1354351118 join 227157309 3: 1674 1354349648 play 227157309 4: 1674 1354780517 join 227157309 5: 1674 1355278891 join 271089456 6: 1674 1354617262 join 221729485 </code></pre> <p>Here is a <em>mess</em> of data table I've cooked up:</p> <pre><code>set2[, id2 := set1[set2[, id1], list(start_unixtime, end_unixtime, id2)][(start_unixtime &lt;= unix_timestamp &amp; unix_timestamp &lt;= end_unixtime), id2, by=id2]][, list(id2)][, id2:= id2] </code></pre> <p>To talk through what I understand to be going on:</p> <ol> <li><code>set2</code> calls the assignment operator <code>:=</code></li> <li>The right hand side invokes <code>set1</code>, which starts by <code>joining</code> id1 rows from set2.</li> <li>The columns <code>start_unixtime</code>, <code>end_unixtime</code>, and <code>id2</code> are selected.</li> <li>From that result, a second set of selections are done, which gets <code>id2</code> where the <code>utc_timestamp</code> of <code>id2</code> is between <code>start_unixtime</code> and <code>end_unixtime</code>.</li> <li>...and here, I think I'm doing something badly wrong -- because at this step, I always seem to have two columns, each labeled <code>id2</code> and containing identical results. So, I select one column...</li> <li>...and specify it for assignment. (I do not know why this is done twice. I found <a href="https://stackoverflow.com/questions/9705488/using-data-table-i-and-j-arguments-in-functions?rq=1">this SO post</a>, which uses this second <code>:=</code>, and <a href="https://stackoverflow.com/questions/10703299/r-using-data-table-operations-to-calculate-new-columns?rq=1">this one</a> which does not, and I have simply no idea why.</li> </ol> <p>...which does not work. @mnel has proposed something like:</p> <pre><code>set2[set1, nomatch=0][unix_timestamp %between c(start_unixtime, end_unixtime, incbounds=T)] </code></pre> <p>...which works when I try it with his test data, but not with my data. It occurs to me that my data might be of some type (character?) which <code>data.table</code> (or R period) might not be coercing properly? I might be dense, but I can't seem to figure out how to call <code>as.integer()</code> on a specified column of a <code>data.table</code>. </p> <p><strong>Edit</strong>: yeah, my data was all character, and I forgot that <code>data.table</code> inherits from <code>data.frame</code>. So, a little <code>set1$start_unixtime &lt;- as.integer($set1$start_unixtime)</code> and at least I'm sure everything is all integers. However, when I run that command, I still get this:</p> <pre><code>&gt;head(set2) Empty data.table (0 rows) of 8 cols: id1,utc_timestamp,event_name,start_unixtime,end_unixtime,seconds_diff... </code></pre> <p><strong>Addition</strong> Here are snippets of my actual data:</p> <pre><code>set1 &lt;- as.data.table(list(id1 = c(1674L, 1674L, 1674L, 1674L, 1674L, 1674L), start_unixtime = c(1354741858L, 1354752386L, 1354764412L, 1354773044L, 1354778651L, 1354810424L), end_unixtime = c(1354741858L, 1354752951L, 1354764412L, 1354773776L, 1354778651L, 1354810424L), seconds_diff = c(0L, 565L, 0L, 732L, 0L, 0L), id2 = c(227167461L, 227246263L, 227358796L, 227421295L, 227448774L, 227631113L)) set2 &lt;- as.data.table(list( id1 = c(1674L, 1674L, 1674L, 1674L, 1674L, 1674L), utc_timestamp = c(1354752431L, 1354780517L, 1354811978L, 1354824385L, 1354833271L, 1354862753L), event_name = c("joinRegularTable_2", "joinRegularTable_2", "joinRegularTable_2", "joinRegularTable_2","joinRegularTable_2", "joinRegularTable_2")) </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload