Note that there are some explanatory texts on larger screens.

plurals
  1. POFrequency weighting in R, comparing results with Stata
    primarykey
    data
    text
    <p>I'm trying to analyze data from the University of Minnesota IPUMS dataset for the <a href="http://usa.ipums.org/usa/sampdesc.shtml#us1990a" rel="nofollow noreferrer">1990 US census</a> in <strong><code>R</code></strong>. I'm using the <a href="http://faculty.washington.edu/tlumley/survey/" rel="nofollow noreferrer"><code>survey</code></a> package because the data is <a href="http://en.wikipedia.org/wiki/Sampling_%28statistics%29#Survey_weights" rel="nofollow noreferrer" title="Wikipedia&#39;s explanation of survey weighting.">weighted</a>. Just taking the household data (and ignoring the person variables to keep things simple), I am attempting to calculate the mean of <code>hhincome</code> (<a href="http://internal.usa.ipums.org/usa-action/variables/HHINCOME" rel="nofollow noreferrer" title="Description of Household Income Variable">household income</a>). To do this I created a <strong>survey design object</strong> using the <a href="http://faculty.washington.edu/tlumley/survey/example-design.html" rel="nofollow noreferrer" title="Documentation of svydesign function"><code>svydesign()</code></a> function with the following code:</p> <pre><code>&gt; require(foreign) &gt; ipums.household &lt;- read.dta("/path/to/stata_export.dta") &gt; ipums.household[ipums.household$hhincome==9999999, "hhincome"] &lt;- NA # Fix missing &gt; ipums.hh.design &lt;- svydesign(id=~1, weights=~hhwt, data=ipums.household) &gt; svymean(ipums.household$hhincome, ipums.hh.design, na.rm=TRUE) mean SE [1,] 37029 17.365 </code></pre> <p>So far so good. However, I get a different standard error if I attempt the same calculation in <strong><code>Stata</code></strong> (using <a href="http://www.stanford.edu/~mrosenfe/soc_meth_proj3/Intro%20to%20STATA%20for%20Soc%20180.htm" rel="nofollow noreferrer" title="Introduction to STATA for Stanford undergrads, search for fweight">code meant for a different portion of the same dataset</a>):</p> <pre><code>use "C:\I\Hate\Backslashes\stata_export.dta" replace hhincome = . if hhincome == 9999999 (933734 real changes made, 933734 to missing) mean hhincome [fweight = hhwt] # The code from the link above. Mean estimation Number of obs = 91746420 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hhincome | 37028.99 3.542749 37022.05 37035.94 -------------------------------------------------------------- </code></pre> <p>And, looking at another way to skin this cat, the author of <code>survey</code>, has <a href="http://pj.freefaculty.org/R/Rtips.html#1.12" rel="nofollow noreferrer" title="Post from Thomas Lumley">this suggestion</a> for frequency weighting:</p> <pre><code>expanded.data&lt;-as.data.frame(lapply(compressed.data, function(x) rep(x,compressed.data$weights))) </code></pre> <p>However, I can't seem to get this code to work:</p> <pre><code>&gt; hh.dataframe &lt;- data.frame(ipums.household$hhincome, ipums.household$hhwt) &gt; expanded.hh.dataframe &lt;- as.data.frame(lapply(hh.dataframe, function(x) rep(x, hh.dataframe$hhwt))) Error in rep(x, hh.dataframe$hhwt) : invalid 'times' argument </code></pre> <p>Which I can't seem to fix. This may be related to <a href="http://r.789695.n4.nabble.com/error-with-source-invalid-times-value-td3234425.html" rel="nofollow noreferrer" title="Thread on invaid times bug">this issue</a>.</p> <p>So in sum:</p> <ol> <li>Why don't I get the same answers in <strong><code>Stata</code></strong> and <strong><code>R</code></strong>?</li> <li>Which one is right (or am I doing something wrong in both cases)?</li> <li>Assuming I got the <code>rep()</code> solution working, would that replicate <strong><code>Stata</code></strong>'s results?</li> <li>What's the <strong>right</strong> way to do it? Kudos if the answer allows me to use the <a href="http://had.co.nz/plyr/" rel="nofollow noreferrer" title="plyr homepage"><code>plyr</code></a> package for doing arbitrary calculations, rather than being limited to the functions implemented in <code>survey</code> (<code>svymean()</code>, <code>svyglm()</code> etc.)</li> </ol> <h1>Update</h1> <p>So after the excellent help I've received here and from IPUMS via email, I'm using the following code to properly handle survey weighting. I describe here in case someone else has this problem in future.</p> <h2>Initial Stata Preparation</h2> <p>Since IPUMS don't currently publish scripts for importing their data into <strong><code>R</code></strong>, you'll need to start from <strong><code>Stata</code></strong>, <strong><code>SAS</code></strong>, or <strong><code>SPSS</code></strong>. I'll stick with <strong><code>Stata</code></strong> for now. Begin by running the import script from IPUMS. Then before continuing add the following variable:</p> <pre><code>generate strata = statefip*100000 + puma </code></pre> <p>This creates a unique integer for each <code>PUMA</code> of the form 240001, with first two digits as the state fip code (24 in the case of Maryland) and the last four a <code>PUMA</code> id which is unique on a per state basis. If you're going to use <strong><code>R</code></strong> you might also find it helpful to run this as well</p> <pre><code>generate statefip_num = statefip * 1 </code></pre> <p>This will create an additional variable without labels, since importing <code>.dta</code> files into <strong><code>R</code></strong> apply the labels and lose the underlying integers.</p> <h2>Stata and <code>svyset</code></h2> <p>As Keith explained, survey sampling is handled by <strong><code>Stata</code></strong> by invoking <code>svyset</code>.</p> <p>For an individual level analysis I now use:</p> <pre><code>svyset serial [pweight=perwt], strata(strata) </code></pre> <p>This sets the weighting to <code>perwt</code>, the stratification to the variable we created above, and uses the household <code>serial</code> number to account for clustering. If we were using multiple years, we might want to try</p> <pre><code>generate double yearserial = year*100000000 + serial </code></pre> <p>to account for longitudinal clustering as well. </p> <p>For household level analysis (without years):</p> <pre><code>svyset serial [pweight=hhwt], strata(strata) </code></pre> <p>Should be self-explanatory (though I think in this case serial is actually superfluous). Replacing <code>serial</code> with <code>yearserial</code> will take into account a time series.</p> <h2>Doing it in <code>R</code></h2> <p>Assuming you're importing a <code>.dta</code> file with the additional <code>strata</code> variable explained above and analysing at the individual letter:</p> <pre><code>require(foreign) ipums &lt;- read.dta('/path/to/data.dta') require(survey) ipums.design &lt;- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt) </code></pre> <p>Or at the household level:</p> <pre><code>ipums.hh.design &lt;- svydesign(id=~serial, strata=~strata, data=ipums, weights=hhwt) </code></pre> <p>Hope someone finds this helpful, and thanks so much to Dwin, Keith and Brandon from IPUMS.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload