Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p><strong>Skewness and Kurtosis</strong></p> <p>For the on-line algorithms for Skewness and Kurtosis (along the lines of the variance), see in the same wiki page <a href="http://en.wikipedia.org/wiki/Algorithms%5Ffor%5Fcalculating%5Fvariance#Higher-order_statistics" rel="noreferrer">here</a> the parallel algorithms for higher-moment statistics.</p> <p><strong>Median</strong></p> <p>Median is tough without sorted data. If you know, how many data points you have, in theory you only have to partially sort, e.g. by using a <a href="http://en.wikipedia.org/wiki/Selection_algorithm" rel="noreferrer">selection algorithm</a>. However, that doesn't help too much with billions of values. I would suggest using frequency counts, see the next section.</p> <p><strong>Median and Mode with Frequency Counts</strong></p> <p>If it is integers, I would count <a href="http://en.wikipedia.org/wiki/Frequency_(statistics)" rel="noreferrer">frequencies</a>, probably cutting off the highest and lowest values beyond some value where I am sure that it is no longer relevant. For floats (or too many integers), I would probably create buckets / intervals, and then use the same approach as for integers. (Approximate) mode and median calculation than gets easy, based on the frequencies table.</p> <p><strong>Normally Distributed Random Variables</strong></p> <p>If it is normally distributed, I would use the population sample <a href="http://en.wikipedia.org/wiki/Mean#Population_and_sample_means" rel="noreferrer">mean</a>, <a href="http://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance" rel="noreferrer">variance</a>, <a href="http://en.wikipedia.org/wiki/Skewness#Sample_skewness" rel="noreferrer">skewness</a>, and <a href="http://en.wikipedia.org/wiki/Kurtosis#Sample_kurtosis" rel="noreferrer">kurtosis</a> as maximum likelihood estimators for a small subset. The (on-line) algorithms to calculate those, you already now. E.g. read in a couple of hundred thousand or million datapoints, until your estimation error gets small enough. Just make sure that you pick randomly from your set (e.g. that you don't introduce a bias by picking the first 100'000 values). The same approach can also be used for estimating mode and median for the normal case (for both the sample mean is an estimator).</p> <p><strong>Further comments</strong></p> <p>All the algorithms above can be run in parallel (including many sorting and selection algorithm, e.g. QuickSort and QuickSelect), if this helps.</p> <p>I have always assumed (with the exception of the section on the normal distribution) that we talk about sample moments, median, and mode, not estimators for theoretical moments given a known distribution.</p> <p>In general, sampling the data (i.e. only looking at a sub-set) should be pretty successful given the amount of data, as long as all observations are realizations of the same random variable (have the same distributions) and the moments, mode and median actually exist for this distribution. The last caveat is not innocuous. For example, the mean (and all higher moments) for the <a href="http://en.wikipedia.org/wiki/Cauchy_distribution" rel="noreferrer">Cauchy Distribution</a> do not exist. In this case, the sample mean of a "small" sub-set might be massively off from the sample mean of the whole sample.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload