Note that there are some explanatory texts on larger screens.

plurals
  1. PODetermine whether there is a subset of size n which has a standard deviation <= s
    primarykey
    data
    text
    <p>Given a bunch of numbers, I am trying to determine whether there is a "clump" anywhere where numbers are very densely packed.</p> <p>To make things more precise, I thought I'd ask a more specific problem: given a set of numbers, I would like to determine whether there is a subset of size <code>n</code> which has a standard deviation &lt;= <code>s</code>. If there are many such subsets, I'd like to find the subset with the lowest standard deviation.</p> <p>So question #1 : does this formal problem definition effectively capture the intuitive concept of a "clump" of densely packed numbers?</p> <ul> <li>EDIT: I don't actually care about determining which numbers belong to this "clump", I'm much more interested in determining where the clump is centred, which is why I think that specifying <code>n</code> in advance is okay. But feel free to correct me!</li> </ul> <p>And question #2 : assuming it does, what is the best way to go about implementing something like this (in particular, I want a solution with lowest time complexity)? So far I think I have a solution that runs in <code>n log n</code>:</p> <ul> <li>First, note that the lowest-standard-deviation-possessing subset of a given size must consist of consecutive numbers. So step 1 is sort the numbers (this is <code>n log n</code>)</li> <li><p>Second, take the first <code>n</code> numbers and compute their standard deviation. If our array of numbers is 0-based, then the first <code>n</code> numbers are <code>[0, n-1]</code>. To get standard deviation, compute <code>s1</code> and <code>s2</code> as follows:</p> <ul> <li><code>s1 = sum of numbers</code></li> <li><code>s2 = sum of squares of numbers</code></li> </ul> <p>Then, <a href="http://en.wikipedia.org/wiki/Standard_deviation#Rapid_calculation_methods" rel="nofollow">wikipedia</a> says that the standard deviation is <code>sqrt(n*s2 - s1^2)/n</code>. Record this value as the highest standard deviation seen so far.</p></li> <li>Find the standard deviation of <code>[1, n]</code>, <code>[2, n+1]</code>, <code>[3, n+2]</code> ... until you hit the the last <code>n</code> numbers. To do each computation takes only constant time if you keep track of <code>s1</code> and <code>s2</code> running totals: for example, to get std dev of <code>[1, n]</code>, just subtract the 0th element from the <code>s1</code> and <code>s2</code> totals and add the nth element, then recalculate standard deviation. This means that the entire standard deviation calculating portion of the algorithm takes linear time.</li> </ul> <p>So total time complexity <code>n log n</code>.</p> <p>Is my assessment right? Is there a better way to do this? I really need this to run <em>fast</em> on fairly large sets, so the faster the better! Space is less of an issue (I <em>think</em>).</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload