Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>If you <em>really</em> want to do this (remove the largest (absolute) residuals), then we can employ the linear model to estimate the least squares solution and associated residuals and then select the middle n% of the data. Here is an example:</p> <p>Firstly, generate some dummy data:</p> <pre><code>require(MASS) ## for mvrnorm() set.seed(1) dat &lt;- mvrnorm(1000, mu = c(4,5), Sigma = matrix(c(1,0.8,1,0.8), ncol = 2)) dat &lt;- data.frame(dat) names(dat) &lt;- c("X","Y") plot(dat) </code></pre> <p>Next, we fit the linear model and extract the residuals:</p> <pre><code>res &lt;- resid(mod &lt;- lm(Y ~ X, data = dat)) </code></pre> <p>The <code>quantile()</code> function can give us the required quantiles of the residuals. You suggested retaining 90% of the data, so we want the upper and lower 0.05 quantiles:</p> <pre><code>res.qt &lt;- quantile(res, probs = c(0.05,0.95)) </code></pre> <p>Select those observations with residuals in the middle 90% of the data:</p> <pre><code>want &lt;- which(res &gt;= res.qt[1] &amp; res &lt;= res.qt[2]) </code></pre> <p>We can then visualise this, with the red points being those we will retain:</p> <pre><code>plot(dat, type = "n") points(dat[-want,], col = "black", pch = 21, bg = "black", cex = 0.8) points(dat[want,], col = "red", pch = 21, bg = "red", cex = 0.8) abline(mod, col = "blue", lwd = 2) </code></pre> <p><img src="https://i.stack.imgur.com/gaOp1.png" alt="The plot produced from the dummy data showing the selected points with the smallest residuals"></p> <p>The correlations for the full data and the selected subset are:</p> <pre><code>&gt; cor(dat) X Y X 1.0000000 0.8935235 Y 0.8935235 1.0000000 &gt; cor(dat[want,]) X Y X 1.0000000 0.9272109 Y 0.9272109 1.0000000 &gt; cor(dat[-want,]) X Y X 1.000000 0.739972 Y 0.739972 1.000000 </code></pre> <p>Be aware that here we might be throwing out perfectly good data, because we just choose the 5% with largest positive residuals and 5% with the largest negative. An alternative is to select the 90% with smallest <em>absolute</em> residuals:</p> <pre><code>ares &lt;- abs(res) absres.qt &lt;- quantile(ares, prob = c(.9)) abswant &lt;- which(ares &lt;= absres.qt) ## plot - virtually the same, but not quite plot(dat, type = "n") points(dat[-abswant,], col = "black", pch = 21, bg = "black", cex = 0.8) points(dat[abswant,], col = "red", pch = 21, bg = "red", cex = 0.8) abline(mod, col = "blue", lwd = 2) </code></pre> <p>With this slightly different subset, the correlation is slightly lower:</p> <pre><code>&gt; cor(dat[abswant,]) X Y X 1.0000000 0.9272032 Y 0.9272032 1.0000000 </code></pre> <p>Another point is that even then we are throwing out good data. You might want to look at Cook's distance as a measure of the strength of the outliers, and discard only those values above a certain threshold Cook's distance. <a href="http://en.wikipedia.org/wiki/Cook&#39;s_distance" rel="noreferrer">Wikipedia</a> has info on Cook's distance and proposed thresholds. The <code>cooks.distance()</code> function can be used to retrieve the values from <code>mod</code>:</p> <pre><code>&gt; head(cooks.distance(mod)) 1 2 3 4 5 6 7.738789e-04 6.056810e-04 6.375505e-04 4.338566e-04 1.163721e-05 1.740565e-03 </code></pre> <p>and if you compute the threshold(s) suggested on Wikipedia and remove only those that exceed the threshold. For these data:</p> <pre><code>&gt; any(cooks.distance(mod) &gt; 1) [1] FALSE &gt; any(cooks.distance(mod) &gt; (4 * nrow(dat))) [1] FALSE </code></pre> <p>none of the Cook's distances exceed the proposed thresholds (not surprising given the way I generated the data.)</p> <p>Having said all of this, why do you want to do this? If you are just trying to get rid of data to improve a correlation or generate a significant relationship, that sounds a bit fishy and bit like data dredging to me.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload