StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POFrequency weighting in R, comparing results with Stata
primarykey
Id
5446078
data
AcceptedAnswerId
5446328
AnswerCount
3
ClosedDate
CommentCount
4
CommunityOwnedDate
CreationDate
2011-03-26T23:20:37.847
FavoriteCount
6
LastActivityDate
2013-06-17T06:27:12.240
LastEditDate
2013-06-17T06:25:48.297
LastEditorUserId
1820446
OwnerUserId
678486
ParentId
0
PostTypeId
1
Score
15
ViewCount
5918
LastEditorDisplayName
text
Body
I'm trying to analyze data from the University of Minnesota IPUMS dataset for the <a href="http://usa.ipums.org/usa/sampdesc.shtml#us1990a" rel="nofollow noreferrer">1990 US census</a> in <code>R</code>. I'm using the <a href="http://faculty.washington.edu/tlumley/survey/" rel="nofollow noreferrer"><code>survey</code></a> package because the data is <a href="http://en.wikipedia.org/wiki/Sampling_%28statistics%29#Survey_weights" rel="nofollow noreferrer" title="Wikipedia's explanation of survey weighting.">weighted</a>. Just taking the household data (and ignoring the person variables to keep things simple), I am attempting to calculate the mean of <code>hhincome</code> (<a href="http://internal.usa.ipums.org/usa-action/variables/HHINCOME" rel="nofollow noreferrer" title="Description of Household Income Variable">household income</a>). To do this I created a survey design object using the <a href="http://faculty.washington.edu/tlumley/survey/example-design.html" rel="nofollow noreferrer" title="Documentation of svydesign function"><code>svydesign()</code></a> function with the following code: <pre><code>> require(foreign) > ipums.household <- read.dta("/path/to/stata_export.dta") > ipums.household[ipums.household$hhincome==9999999, "hhincome"] <- NA # Fix missing > ipums.hh.design <- svydesign(id=~1, weights=~hhwt, data=ipums.household) > svymean(ipums.household$hhincome, ipums.hh.design, na.rm=TRUE) mean SE [1,] 37029 17.365 </code></pre> So far so good. However, I get a different standard error if I attempt the same calculation in <code>Stata</code> (using <a href="http://www.stanford.edu/~mrosenfe/soc_meth_proj3/Intro%20to%20STATA%20for%20Soc%20180.htm" rel="nofollow noreferrer" title="Introduction to STATA for Stanford undergrads, search for fweight">code meant for a different portion of the same dataset</a>): <pre><code>use "C:\I\Hate\Backslashes\stata_export.dta" replace hhincome = . if hhincome == 9999999 (933734 real changes made, 933734 to missing) mean hhincome [fweight = hhwt] # The code from the link above. Mean estimation Number of obs = 91746420 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hhincome | 37028.99 3.542749 37022.05 37035.94 -------------------------------------------------------------- </code></pre> And, looking at another way to skin this cat, the author of <code>survey</code>, has <a href="http://pj.freefaculty.org/R/Rtips.html#1.12" rel="nofollow noreferrer" title="Post from Thomas Lumley">this suggestion</a> for frequency weighting: <pre><code>expanded.data<-as.data.frame(lapply(compressed.data, function(x) rep(x,compressed.data$weights))) </code></pre> However, I can't seem to get this code to work: <pre><code>> hh.dataframe <- data.frame(ipums.household$hhincome, ipums.household$hhwt) > expanded.hh.dataframe <- as.data.frame(lapply(hh.dataframe, function(x) rep(x, hh.dataframe$hhwt))) Error in rep(x, hh.dataframe$hhwt) : invalid 'times' argument </code></pre> Which I can't seem to fix. This may be related to <a href="http://r.789695.n4.nabble.com/error-with-source-invalid-times-value-td3234425.html" rel="nofollow noreferrer" title="Thread on invaid times bug">this issue</a>. So in sum: <ol> <li>Why don't I get the same answers in <code>Stata</code> and <code>R</code>?</li> <li>Which one is right (or am I doing something wrong in both cases)?</li> <li>Assuming I got the <code>rep()</code> solution working, would that replicate <code>Stata</code>'s results?</li> <li>What's the right way to do it? Kudos if the answer allows me to use the <a href="http://had.co.nz/plyr/" rel="nofollow noreferrer" title="plyr homepage"><code>plyr</code></a> package for doing arbitrary calculations, rather than being limited to the functions implemented in <code>survey</code> (<code>svymean()</code>, <code>svyglm()</code> etc.)</li> </ol> <h1>Update</h1> So after the excellent help I've received here and from IPUMS via email, I'm using the following code to properly handle survey weighting. I describe here in case someone else has this problem in future. <h2>Initial Stata Preparation</h2> Since IPUMS don't currently publish scripts for importing their data into <code>R</code>, you'll need to start from <code>Stata</code>, <code>SAS</code>, or <code>SPSS</code>. I'll stick with <code>Stata</code> for now. Begin by running the import script from IPUMS. Then before continuing add the following variable: <pre><code>generate strata = statefip*100000 + puma </code></pre> This creates a unique integer for each <code>PUMA</code> of the form 240001, with first two digits as the state fip code (24 in the case of Maryland) and the last four a <code>PUMA</code> id which is unique on a per state basis. If you're going to use <code>R</code> you might also find it helpful to run this as well <pre><code>generate statefip_num = statefip * 1 </code></pre> This will create an additional variable without labels, since importing <code>.dta</code> files into <code>R</code> apply the labels and lose the underlying integers. <h2>Stata and <code>svyset</code></h2> As Keith explained, survey sampling is handled by <code>Stata</code> by invoking <code>svyset</code>. For an individual level analysis I now use: <pre><code>svyset serial [pweight=perwt], strata(strata) </code></pre> This sets the weighting to <code>perwt</code>, the stratification to the variable we created above, and uses the household <code>serial</code> number to account for clustering. If we were using multiple years, we might want to try <pre><code>generate double yearserial = year*100000000 + serial </code></pre> to account for longitudinal clustering as well. For household level analysis (without years): <pre><code>svyset serial [pweight=hhwt], strata(strata) </code></pre> Should be self-explanatory (though I think in this case serial is actually superfluous). Replacing <code>serial</code> with <code>yearserial</code> will take into account a time series. <h2>Doing it in <code>R</code></h2> Assuming you're importing a <code>.dta</code> file with the additional <code>strata</code> variable explained above and analysing at the individual letter: <pre><code>require(foreign) ipums <- read.dta('/path/to/data.dta') require(survey) ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt) </code></pre> Or at the household level: <pre><code>ipums.hh.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=hhwt) </code></pre> Hope someone finds this helpful, and thanks so much to Dwin, Keith and Brandon from IPUMS.
Tags
<r><stata>
Title
Frequency weighting in R, comparing results with Stata
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USNick Cox
UserOwnerUserId
1. USGriffith Rees
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POFrequency weighting in R, comparing results with Stata
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POFrequency weighting in R, comparing results with Stata
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POFrequency weighting in R, comparing results with Stata
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.