Note that there are some explanatory texts on larger screens.

plurals
  1. POJoin results in more than 2^31 rows (internal vecseq reached physical limit)
    primarykey
    data
    text
    <p>I just tried merging two tables in R 3.0.1 on a machine with 64G ram and got the following error. Help would be appreciated. (the data.table version is 1.8.8)</p> <p>Here is what my code looks like: </p> <pre><code>library(parallel) library(data.table) </code></pre> <p>data1: several million rows and 3 columns. The columns are <code>tag</code>, <code>prod</code> and <code>v</code>. There are 750K unique values of <code>tag</code>, anywhere from 1 to 1000 <code>prod</code>s per <code>tag</code>, 5000 possible values for <code>prod</code>. <code>v</code> takes any positive real value.</p> <pre><code>setkey(data1, tag) merge (data1, data1, allow.cartesian=TRUE) </code></pre> <p>I get the following error:</p> <blockquote> <p>Error in vecseq(f_<em>, len</em>_, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including <code>j</code> and dropping <code>by</code> (by-without-by) so that j runs for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice. Calls: merge -> merge.data.table -> [ -> [.data.table -> vecseq</p> </blockquote> <h1>new example showing by-without-by</h1> <pre><code>country = fread(" country product share 1 5 .2 1 6 .2 1 7 .6 2 6 .3 2 7 .1 2 8 .4 2 9 .2 ") prod = fread(" prod period value 5 1990 2 5 1991 3 5 1992 2 5 1993 4 5 1994 3 5 1995 5 6 1990 1 6 1991 1 6 1992 0 6 1993 4 6 1994 8 6 1995 2 7 1990 3 7 1991 3 7 1992 3 7 1993 4 7 1994 7 7 1995 1 8 1990 2 8 1991 4 8 1992 2 8 1993 4 8 1994 2 8 1995 6 9 1990 1 9 1991 2 9 1992 4 9 1993 4 9 1994 5 9 1995 6 ") </code></pre> <p>It seems entirely impossible to selected the subset of markets that share a country tag, find the covariances within those pairs, and collate those by country without running up against the size limit. Here is my best shot so far:</p> <pre><code>setkey(country,country) setkey(prod, prod, period) covars &lt;- setkey(setkey(unique(country[country, allow.cartesian=T][, c("prod","prod.1"), with=F]),prod)[prod, allow.cartesian=T], prod.1, period)[prod, ] [ , list(pcov = cov(value,value.1)), by=list(prod,prod.1)] # really long oneliner that finds unique market pairs from the the self-join, merges it with the second table and calculates covariances from the merged table. clevel &lt;-setkey(country[country, allow.cartesian = T], prod, prod.1)[covars, nomatch=0][ , list(countryvar = sum(share*share.1*pcov)), by="country"] &gt; clevel country countryvar 1: 1 2.858667 2: 2 1.869667 </code></pre> <p>When I try this approach for any reasonable size of data, I run up against the vecseq error. It would be really nice if data.table did not balk so much at the 2^31 limit. I am a fan of the package. Suggestions on how I can use more of the j specification would also be appreciated. (I am not sure how else to try the J specification given how I have to compute variances from the the intersection of the two data tables).</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload