StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POJoin results in more than 2^31 rows (internal vecseq reached physical limit)
primarykey
Id
18102042
data
AcceptedAnswerId
0
AnswerCount
2
ClosedDate
CommentCount
14
CommunityOwnedDate
CreationDate
2013-08-07T11:20:20.357
FavoriteCount
0
LastActivityDate
2013-08-27T22:43:27.210
LastEditDate
2013-08-27T22:43:27.210
LastEditorUserId
2627717
OwnerUserId
2627717
ParentId
0
PostTypeId
1
Score
2
ViewCount
2929
LastEditorDisplayName
text
Body
I just tried merging two tables in R 3.0.1 on a machine with 64G ram and got the following error. Help would be appreciated. (the data.table version is 1.8.8) Here is what my code looks like: <pre><code>library(parallel) library(data.table) </code></pre> data1: several million rows and 3 columns. The columns are <code>tag</code>, <code>prod</code> and <code>v</code>. There are 750K unique values of <code>tag</code>, anywhere from 1 to 1000 <code>prod</code>s per <code>tag</code>, 5000 possible values for <code>prod</code>. <code>v</code> takes any positive real value. <pre><code>setkey(data1, tag) merge (data1, data1, allow.cartesian=TRUE) </code></pre> I get the following error: <blockquote> Error in vecseq(f_, len_, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including <code>j</code> and dropping <code>by</code> (by-without-by) so that j runs for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice. Calls: merge -> merge.data.table -> [ -> [.data.table -> vecseq </blockquote> <h1>new example showing by-without-by</h1> <pre><code>country = fread(" country product share 1 5 .2 1 6 .2 1 7 .6 2 6 .3 2 7 .1 2 8 .4 2 9 .2 ") prod = fread(" prod period value 5 1990 2 5 1991 3 5 1992 2 5 1993 4 5 1994 3 5 1995 5 6 1990 1 6 1991 1 6 1992 0 6 1993 4 6 1994 8 6 1995 2 7 1990 3 7 1991 3 7 1992 3 7 1993 4 7 1994 7 7 1995 1 8 1990 2 8 1991 4 8 1992 2 8 1993 4 8 1994 2 8 1995 6 9 1990 1 9 1991 2 9 1992 4 9 1993 4 9 1994 5 9 1995 6 ") </code></pre> It seems entirely impossible to selected the subset of markets that share a country tag, find the covariances within those pairs, and collate those by country without running up against the size limit. Here is my best shot so far: <pre><code>setkey(country,country) setkey(prod, prod, period) covars <- setkey(setkey(unique(country[country, allow.cartesian=T][, c("prod","prod.1"), with=F]),prod)[prod, allow.cartesian=T], prod.1, period)[prod, ] [ , list(pcov = cov(value,value.1)), by=list(prod,prod.1)] # really long oneliner that finds unique market pairs from the the self-join, merges it with the second table and calculates covariances from the merged table. clevel <-setkey(country[country, allow.cartesian = T], prod, prod.1)[covars, nomatch=0][ , list(countryvar = sum(share*share.1*pcov)), by="country"] > clevel country countryvar 1: 1 2.858667 2: 2 1.869667 </code></pre> When I try this approach for any reasonable size of data, I run up against the vecseq error. It would be really nice if data.table did not balk so much at the 2^31 limit. I am a fan of the package. Suggestions on how I can use more of the j specification would also be appreciated. (I am not sure how else to try the J specification given how I have to compute variances from the the intersection of the two data tables).
Tags
<r><merge><data.table>
Title
Join results in more than 2^31 rows (internal vecseq reached physical limit)
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USuser2627717
UserOwnerUserId
1. USuser2627717
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POJoin results in more than 2^31 rows (internal vecseq reached physical limit)
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POJoin results in more than 2^31 rows (internal vecseq reached physical limit)
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTDownMod
3. VO
 singulars
 PostPostId
 POJoin results in more than 2^31 rows (internal vecseq reached physical limit)
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.