StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PODetermine whether there is a subset of size n which has a standard deviation <= s
primarykey
Id
17266642
data
AcceptedAnswerId
17267046
AnswerCount
1
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2013-06-24T00:34:25.297
FavoriteCount
0
LastActivityDate
2013-06-24T01:51:49.060
LastEditDate
2013-06-24T00:47:57.223
LastEditorUserId
801901
OwnerUserId
801901
ParentId
0
PostTypeId
1
Score
1
ViewCount
379
LastEditorDisplayName
text
Body
Given a bunch of numbers, I am trying to determine whether there is a "clump" anywhere where numbers are very densely packed. To make things more precise, I thought I'd ask a more specific problem: given a set of numbers, I would like to determine whether there is a subset of size <code>n</code> which has a standard deviation <= <code>s</code>. If there are many such subsets, I'd like to find the subset with the lowest standard deviation. So question #1 : does this formal problem definition effectively capture the intuitive concept of a "clump" of densely packed numbers? <ul> <li>EDIT: I don't actually care about determining which numbers belong to this "clump", I'm much more interested in determining where the clump is centred, which is why I think that specifying <code>n</code> in advance is okay. But feel free to correct me!</li> </ul> And question #2 : assuming it does, what is the best way to go about implementing something like this (in particular, I want a solution with lowest time complexity)? So far I think I have a solution that runs in <code>n log n</code>: <ul> <li>First, note that the lowest-standard-deviation-possessing subset of a given size must consist of consecutive numbers. So step 1 is sort the numbers (this is <code>n log n</code>)</li> <li>Second, take the first <code>n</code> numbers and compute their standard deviation. If our array of numbers is 0-based, then the first <code>n</code> numbers are <code>[0, n-1]</code>. To get standard deviation, compute <code>s1</code> and <code>s2</code> as follows: <ul> <li><code>s1 = sum of numbers</code></li> <li><code>s2 = sum of squares of numbers</code></li> </ul> Then, <a href="http://en.wikipedia.org/wiki/Standard_deviation#Rapid_calculation_methods" rel="nofollow">wikipedia</a> says that the standard deviation is <code>sqrt(n*s2 - s1^2)/n</code>. Record this value as the highest standard deviation seen so far.</li> <li>Find the standard deviation of <code>[1, n]</code>, <code>[2, n+1]</code>, <code>[3, n+2]</code> ... until you hit the the last <code>n</code> numbers. To do each computation takes only constant time if you keep track of <code>s1</code> and <code>s2</code> running totals: for example, to get std dev of <code>[1, n]</code>, just subtract the 0th element from the <code>s1</code> and <code>s2</code> totals and add the nth element, then recalculate standard deviation. This means that the entire standard deviation calculating portion of the algorithm takes linear time.</li> </ul> So total time complexity <code>n log n</code>. Is my assessment right? Is there a better way to do this? I really need this to run fast on fairly large sets, so the faster the better! Space is less of an issue (I think).
Tags
<time-complexity><subset><standard-deviation>
Title
Determine whether there is a subset of size n which has a standard deviation <= s
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USOrd
UserOwnerUserId
1. USOrd
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PODetermine whether there is a subset of size n which has a standard deviation <= s
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.