Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>This may have been already obvious to the OP, but just to make sure... You have to be careful because trying to maxmimize correlation may actually tend to <em>include</em> outliers. (@Gavin touched on this point in his answer/comments.) I would be <em>first</em> removing outliers, <em>then</em> calculating a correlation. More generally, we want to be calculating a correlation that is robust to outliers (and there are many such methods in R). </p> <p>Just to illustrate this dramatically, let's create two vectors <code>x</code> and <code>y</code> that are uncorrelated:</p> <pre><code>set.seed(1) x &lt;- rnorm(1000) y &lt;- rnorm(1000) &gt; cor(x,y) [1] 0.006401211 </code></pre> <p>Now let's add an outlier point <code>(500,500)</code>:</p> <pre><code>x &lt;- c(x, 500) y &lt;- c(y, 500) </code></pre> <p>Now the correlation of <em>any</em> subset that includes the outlier point will be close to 100%, and the correlation of any sufficiently large subset that excludes the outlier will be close to zero. In particular, </p> <pre><code>&gt; cor(x,y) [1] 0.995741 </code></pre> <p>If you want to estimate a "true" correlation that is not sensitive to outliers, you might try the <code>robust</code> package:</p> <pre><code>require(robust) &gt; covRob(cbind(x,y), corr = TRUE) Call: covRob(data = cbind(x, y), corr = TRUE) Robust Estimate of Correlation: x y x 1.00000000 -0.02594260 y -0.02594260 1.00000000 </code></pre> <p>You can play around with parameters of <code>covRob</code> to decide how to trim the data. <strong><em>UPDATE:</em></strong> There is also the <code>rlm</code> (robust linear regression) in the <code>MASS</code> package.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload