Note that there are some explanatory texts on larger screens.

plurals
  1. POR ff / ffbase merge creating rows
    primarykey
    data
    text
    <p>I'm using the <code>ff</code> and <code>ffbase</code> packages to combine two <code>ffdf</code> objects but when I use the <code>merge</code> statement it goes from 1 million lines in the target <code>ffdf</code> to 8 million lines.</p> <p>ffdf1 is 1 million rows by 6 columns:</p> <pre><code>&gt; summary(ffdf2) Length Class Mode userid 1000000 ff_vector list V2 1000000 ff_vector list V3 1000000 ff_vector list V4 1000000 ff_vector list V5 1000000 ff_vector list V6 1000000 ff_vector list </code></pre> <p>ffdf2 is ~20 million rows X 3 columns like this:</p> <pre><code> userid gender age 1 1 3 2 1 2 3 2 5 4 0 4 5 2 3 ... ... ... </code></pre> <p>I use the following code to merge the two:</p> <pre><code>ffdf3 &lt;- merge(ffdf1, ffdf2, by.x="userid",by.y="userid",all.x=T,sort=F) </code></pre> <p>The result is this:</p> <pre><code>&gt; summary(ffdf3) Length Class Mode userid 8000000 ff_vector list V2 8000000 ff_vector list V3 8000000 ff_vector list V4 8000000 ff_vector list V5 8000000 ff_vector list V6 8000000 ff_vector list gender 8000000 ff_vector list age 8000000 ff_vector list </code></pre> <p>Any ideas why the length is going from 1mm to 8mm?</p> <p>EDIT:</p> <p>When I try this:</p> <pre><code>ffdf3 &lt;- merge(ffdf1, ffdf2, by.x="userid",by.y="userid",all.x=F,sort=F) </code></pre> <p>I get:</p> <pre><code>&gt; summary(ffdf3) Length Class Mode userid 740383 ff_vector list V2 740383 ff_vector list V3 740383 ff_vector list V4 740383 ff_vector list V5 740383 ff_vector list V6 740383 ff_vector list gender 740383 ff_vector list age 740383 ff_vector list </code></pre> <p>Also here is the output from running the merge:</p> <pre><code>2012-05-13 14:49:06, x has 2 chunks, y has 8 chunks 2012-05-13 14:49:06, working on x chunk 1:500000 2012-05-13 14:49:07, working on y chunk 1:2958661 2012-05-13 14:49:16, working on y chunk 2958662:5917322 2012-05-13 14:49:32, working on y chunk 5917323:8875983 2012-05-13 14:49:45, working on y chunk 8875984:11834644 2012-05-13 14:49:57, working on y chunk 11834645:14793305 2012-05-13 14:50:09, working on y chunk 14793306:17751966 2012-05-13 14:50:20, working on y chunk 17751967:20710627 2012-05-13 14:50:30, working on y chunk 20710628:23669283 2012-05-13 14:50:40, working on x chunk 500001:1000000 2012-05-13 14:50:41, working on y chunk 1:2958661 2012-05-13 14:50:52, working on y chunk 2958662:5917322 2012-05-13 14:51:03, working on y chunk 5917323:8875983 2012-05-13 14:51:14, working on y chunk 8875984:11834644 2012-05-13 14:51:24, working on y chunk 11834645:14793305 2012-05-13 14:51:36, working on y chunk 14793306:17751966 2012-05-13 14:51:47, working on y chunk 17751967:20710627 2012-05-13 14:51:58, working on y chunk 20710628:23669283 </code></pre> <p>Also <code>ffdf1</code> contains 677840 unique <code>userid</code>'s. So there are some duplicates in the 1mm rows.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload