Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to handle missing data (NA) in for loops in R
    text
    copied!<p>I am trying to calculate Chi Square discrepancies for my data and for a similar simulated dataset to evaluate the fit of a model (using Bayesian inference). My problem is that my dataset contains missing values but my simulated dataset does not, so I can't compare the discrepancy stats from each.</p> <p>Here is some simulated data similar to what I'm working with:</p> <pre><code>p &lt;- array(runif(3000*195*6, 0, 1), c(3000, 195, 6)) N &lt;- array(rpois(3000*195, 10), c(3000, 195)) y &lt;- array(0, c(195, 6)) for(j in 1:195){ for(k in 1:6){ y[j,k] &lt;- (rbinom(1, N[j], p[1,j,k])) } } foo &lt;- runif(50, 1, 195) bar &lt;- runif(50, 1, 6) for(i in 1:50){ y[foo[i], bar[i]] &lt;- NA } </code></pre> <p>This gives my response variable y with some missing values ("NA"). Now I calculate basically a Chi Square for my data "y" and for a simulated "ideal" dataset "y.new". However, y.new does not have any missing values, so when I try to compare the sum of E and E.new, E.new should always be larger if I leave out the missing data in y but not y.new. </p> <pre><code>eval &lt;- array(NA, c(3000, 195, 6)) E &lt;- array(NA, c(3000, 195, 6)) E.new &lt;- array(NA, c(3000, 195, 6)) y.new &lt;- array(NA, c(195, 6)) for(i in 1:3000){ for(j in 1:195){ for(k in 1:6){ eval[i,j,k] &lt;- p[i,j,k]*N[i,j] E[i,j,k] &lt;- ((y[j,k] - eval[i,j,k])^2) / (eval[i,j,k] + 0.5) y.new[i,j,k] &lt;- rbinom(1, N[i,j], p[i,j,k]) # Create new "ideal" dataset E.new[i,j,k] &lt;- ((y.new[i,j,k] - eval[i,j,k])^2) / (eval[i,j,k] + 0.5) } } } # very slow! think about how to vectorize instead of nested for loops fit &lt;- sum(E) fit.new &lt;- sum(E.new) </code></pre> <p>My question is how best handle the missing values? Currently, the code above cannot subtract eval from y because of the missing values. Even if it could, the fit and fit.new wouldn't be comparable. My idea is to find the location of the missing values in y and drop those same [j,k] values from all the other arrays that I'm using. Any suggestions on how to best do this?</p> <p>EDIT: I'm getting a very strange result. Whether I run the code as above or as below (using sweep), E[1,,] is much smaller than E[>1,,]. What is particularly strange is that eval[1,,] and eval[>1,,] appear to be the same. I even tried replicating y[j,k] to make it y[i,j,k] where each y[i,,] were equal, just to see if it was the handling of different size matrices that was the problem. Does anyone know why this would be the case? In theory, with this simulated data, I think all the iterations of E[i,,] and E.new[i,,] should be somewhat similar. Below is some summary info to show what I'm talking about. This seems like a new question, but it relates to my original question, I just thought it must be the NA that were causing the problem but it seems like that might not be the only thing going on.</p> <pre><code>&gt; summary(eval[1,,]) V1 V2 V3 V4 Min. : 0.01167 Min. : 0.01476 Min. : 0.0293 Min. : 0.01953 1st Qu.: 2.60909 1st Qu.: 2.35093 1st Qu.: 2.5239 1st Qu.: 1.85789 Median : 4.85460 Median : 5.12719 Median : 5.2480 Median : 4.35639 Mean : 5.09371 Mean : 5.39451 Mean : 5.3891 Mean : 4.72061 3rd Qu.: 6.91273 3rd Qu.: 7.44676 3rd Qu.: 7.5431 3rd Qu.: 7.06119 Max. :15.81298 Max. :14.94309 Max. :14.9851 Max. :16.25751 &gt; summary(eval1[2,,]) V1 V2 V3 V4 Min. : 0.06346 Min. : 0.06468 Min. : 0.2092 Min. : 0.006769 1st Qu.: 2.44825 1st Qu.: 1.93702 1st Qu.: 2.4226 1st Qu.: 2.426689 Median : 4.16865 Median : 4.01536 Median : 5.0771 Median : 4.833679 Mean : 4.85646 Mean : 4.64887 Mean : 5.3450 Mean : 5.169656 3rd Qu.: 6.64691 3rd Qu.: 6.96278 3rd Qu.: 7.7034 3rd Qu.: 7.229125 Max. :13.00335 Max. :13.79093 Max. :17.2673 Max. :17.915080 &gt; summary(E[1,,]) V1 V2 V3 V4 Min. :0.00001 Min. :0.00000 Min. :0.000003 Min. :0.000008 1st Qu.:0.02744 1st Qu.:0.02723 1st Qu.:0.023008 1st Qu.:0.035854 Median :0.11750 Median :0.11889 Median :0.109138 Median :0.146706 Mean :0.39880 Mean :0.41636 Mean :0.353876 Mean :0.479533 3rd Qu.:0.46435 3rd Qu.:0.40993 3rd Qu.:0.390625 3rd Qu.:0.604021 Max. :4.43466 Max. :4.83871 Max. :6.254577 Max. :5.231650 NA's :10 NA's :8 NA's :8 NA's :10 &gt; summary(E[2,,]) V1 V2 V3 Min. : 0.0000 Min. : 0.00003 Min. : 0.00002 1st Qu.: 0.8213 1st Qu.: 0.42091 1st Qu.: 0.36853 Median : 2.0454 Median : 2.31697 Median : 2.39892 Mean : 8.0619 Mean : 9.40838 Mean : 6.38919 3rd Qu.: 5.6755 3rd Qu.: 6.34782 3rd Qu.: 4.89749 Max. :395.9499 Max. :172.83324 Max. :120.93648 NA's :10 NA's :8 NA's :8 </code></pre> <p>Thanks, Dan</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload