Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Both @Gavin and @Martin deserve credit for this answer, or at least leading me in the right direction. I'm mostly answering it separately to make it easier to read.</p> <p>In the order I asked:</p> <ol> <li><p>Yes 2^31 is a hard limit in <code>R</code>, though it seems to matter what type it is (which is a bit strange given it is the <em>length</em> of the vector, rather than the amount of memory (which I have plenty of) which is the stated problem. Do <strong>not</strong> convert <code>strata</code> or <code>id</code> variables to <code>factors</code>, that will just fix their length and nullify the effects of subsetting (which is the way to get around this problem).</p></li> <li><p><code>sql</code> could probably help, provided I learn how to use it correctly. I did the following test:</p> <pre><code>library(multicore) # make svy fast! ri.ny &lt;- subset(ipums, statefips_num %in% c(36, 44)) ri.ny.design &lt;- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ri.ny) svyby(~incwage, ~strata, ri.ny.design, svymean, data=ri.ny, na.rm=TRUE, multicore=TRUE) ri &lt;- subset(ri.ny, statefips_num==44) ri.design &lt;- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ri) ri.mean &lt;- svymean(~incwage, ri.design, data=ri, na.rm=TRUE) ny &lt;- subset(ri.ny, statefips_num==36) ny.design &lt;- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ny) ny.mean &lt;- svymean(~incwage, ny.design, data=ny, na.rm=TRUE, multicore=TRUE) </code></pre> <p>And found the means to be the same, which seems like a reasonable test.</p> <p>So: in theory, provided I can split up the calculation by either using <code>plyr</code> or <code>sql</code>, the results should still be fine.</p></li> <li><p>See 2.</p></li> <li><p>Throwing a lot of memory at <code>Stata</code> definitely helps, but now I'm running into annoying formatting issues. I seem to be able to perform most of the calculation I want (much quicker and with more stability as well) but I can't figure out how to get it into the form I want. Will probably ask a separate question on this. I think the short version here is that for big survey data, <code>Stata</code> is much better out of the box.</p></li> <li><p>In many ways yes. Trying to do analysis with data this big is not something I should have taken on lightly, and I'm far from figuring it out even now. I was using the <code>svydesign</code> function correctly, but I didn't really know what's going on. I have a (very slightly) better grasp now, and it's heartening to know I was generally correct about how to solve the problem. @Gavin's general suggestion of trying out small data with external results to compare to is invaluable, something I should have started ages ago. Many thanks to both @Gavin and @Martin.</p></li> </ol>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload