StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>[Huge thanks to John D Cook for much of what I learned in putting together this answer!]</p> <p>First, here's the reason not to use sum-of-squares: <a href="http://www.johndcook.com/blog/2008/09/26/" rel="nofollow noreferrer">http://www.johndcook.com/blog/2008/09/26/</a></p> <p>What you should do instead:</p> <p>Keep track of the count (n), the mean (u), and a quantity (s) from which sample variance and standard error can be determined. (Adapted from <a href="http://www.johndcook.com/standard_deviation.html" rel="nofollow noreferrer">http://www.johndcook.com/standard_deviation.html</a>.)</p> <p>Initialize <code>n = u = s = 0</code>. </p> <p>For each new datapoint, <code>x</code>:</p> <pre><code>u0 = u; n ++; u += (x - u) / n; s += (x - u0) * (x - u); </code></pre> <p>The sample variance is then <code>s/(n-1)</code>, the variance of the sample mean is <code>s/(n-1)/n</code>, and the standard error of the sample mean is <code>SE = sqrt(s/(n-1)/n)</code>.</p> <p>It remains to compute the Student-t <code>c</code>-confidence interval (<code>c</code> in (0,1)):</p> <pre><code>u [plus or minus] SE*g((1-c)/2, n-1) </code></pre> <p>where <code>g</code> is the inverse cdf (aka quantile) of the Student-t distribution with mean 0 and variance 1, taking a probability and the degrees of freedom (one less than the number of data points):</p> <pre><code>g(p,df) = sign(2*p-1)*sqrt(df)*sqrt(1/irib(1, -abs(2*p-1), df/2, 1/2) - 1) </code></pre> <p>where <code>irib</code> is the inverse regularized incomplete beta function:</p> <pre><code>irib(s0,s1,a,b) = z such that rib(s0,z,a,b) = s1 </code></pre> <p>where <code>rib</code> is the regularized incomplete beta function:</p> <pre><code>rib(x0,x1,a,b) = B(x0,x1,a,b) / B(a,b) </code></pre> <p>where <code>B(a,b)</code> is the Euler beta function and <code>B(x0,x1,a,b)</code> is the incomplete beta function:</p> <pre><code>B(a,b) = Gamma(a)*Gamma(b)/Gamma(a+b) = integral_0^1 t^(a-1)*(1-t)^(b-1) dt B(x0,x1,a,b) = integral_x0^x1 t^(a-1)*(1-t)^(b-1) dt </code></pre> <p>Typical numerical/statistics libraries will have implementations of the beta function (or the inverse cdf of the Student-t distribution directly). For C, the de facto standard is the Gnu Scientific Library (GSL). Often a 3-argument version of the beta function is given; the generalization to 4 arguments is as follows:</p> <pre><code>B(x0,x1,a,b) = B(x1,a,b) - B(x0,a,b) rib(x0,x1,a,b) = rib(x1,a,b) - rib(x0,a,b) </code></pre> <p>Finally, here is an implementation in Mathematica:</p> <pre><code>(* Take current {n,u,s} and new data point; return new {n,u,s}. *) update[{n_,u_,s_}, x_] := {n+1, u+(x-u)/(n+1), s+(x-u)(x-(u+(x-u)/(n+1)))} Needs["HypothesisTesting`"]; g[p_, df_] := InverseCDF[StudentTDistribution[df], p] (* Mean CI given n,u,s and confidence level c. *) mci[n_,u_,s_, c_:.95] := With[{d = Sqrt[s/(n-1)/n]*g[(1-c)/2, n-1]}, {u+d, u-d}] </code></pre> <p>Compare to</p> <pre><code>StudentTCI[u, SE, n-1, ConfidenceLevel->c] </code></pre> <p>or, when the entire list of data points is available,</p> <pre><code>MeanCI[list, ConfidenceLevel->c] </code></pre> <p>Finally, if you don't want to load math libraries for things like the beta function, you can hardcode a lookup table for <code>-g((1-c)/2, n-1)</code>. Here it is for <code>c=.95</code> and <code>n=2..100</code>:</p> <p>12.706204736174698, 4.302652729749464, 3.182446305283708, 2.7764451051977934, 2.570581835636314, 2.4469118511449666, 2.3646242515927853, 2.306004135204168, 2.262157162798205, 2.2281388519862735, 2.2009851600916384, 2.178812829667226, 2.1603686564627917, 2.1447866879178012, 2.131449545559774, 2.1199052992212533, 2.1098155778333156, 2.100922040241039, 2.093024054408307, 2.0859634472658626, 2.0796138447276835, 2.073873067904019, 2.0686576104190477, 2.0638985616280254, 2.0595385527532963, 2.05552943864287, 2.051830516480281, 2.048407141795243, 2.0452296421327034, 2.042272456301236, 2.039513446396408, 2.0369333434600976, 2.0345152974493392, 2.032244509317719, 2.030107928250338, 2.0280940009804462, 2.0261924630291066, 2.024394163911966, 2.022690920036762, 2.0210753903062715, 2.0195409704413745, 2.018081702818439, 2.016692199227822, 2.0153675744437627, 2.0141033888808457, 2.0128955989194246, 2.011740513729764, 2.0106347576242314, 2.0095752371292335, 2.0085591121007527, 2.007583770315835, 2.0066468050616857, 2.005745995317864, 2.0048792881880577, 2.004044783289136, 2.0032407188478696, 2.002465459291016, 2.001717484145232, 2.000995378088259, 2.0002978220142578, 1.9996235849949402, 1.998971517033376, 1.9983405425207483, 1.997729654317692, 1.9971379083920013, 1.9965644189523084, 1.996008354025304, 1.9954689314298386, 1.994945415107228, 1.9944371117711894, 1.9939433678456229, 1.993463566661884, 1.9929971258898527, 1.9925434951809258, 1.992102154002232, 1.9916726096446793, 1.9912543953883763, 1.9908470688116922, 1.9904502102301198, 1.990063421254452, 1.989686323456895, 1.9893185571365664, 1.9889597801751728, 1.9886096669757192, 1.9882679074772156, 1.9879342062390228, 1.9876082815890748, 1.9872898648311672, 1.9869786995062702, 1.986674540703777, 1.986377154418625, 1.9860863169510985, 1.9858018143458114, 1.9855234418666061, 1.9852510035054973, 1.9849843115224508, 1.9847231860139618, 1.98446745450849, 1.9842169515863888</p> <p>which is asymptotically approaching the inverse CDF of a normal(0,1) distribution for <code>c=.95</code>, which is:</p> <pre><code>-sqrt(2)*InverseErf(-c) = 1.959963984540054235524594430520551527955550... </code></pre> <p>See <a href="http://mathworld.wolfram.com/InverseErf.html" rel="nofollow noreferrer">http://mathworld.wolfram.com/InverseErf.html</a> for the inverse <code>erf()</code> function. Notice that <code>g((1-.95)/2,n-1)</code> doesn't round to 1.96 until there are at least 474 data points. It rounds to 2.0 when there are 29 data points.</p> <p>As a rule of thumb, you should use Student-t instead of the normal approximation for <code>n</code> up to at least 300, not 30 per conventional wisdom. Cf. <a href="http://www.johndcook.com/blog/2008/11/12/" rel="nofollow noreferrer">http://www.johndcook.com/blog/2008/11/12/</a>.</p> <p>See also "Improving Compressed Counting" by Ping Li of Cornell.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload