StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
13575018
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2012-11-26T23:34:37.183
FavoriteCount
0
LastActivityDate
2012-11-27T23:29:55.803
LastEditDate
2012-11-27T23:29:55.803
LastEditorUserId
1667256
OwnerUserId
1667256
ParentId
13560272
PostTypeId
2
Score
5
ViewCount
0
LastEditorDisplayName
text
Body
<p>@Ben Allison's answer is a good way if you want to count the total lines. Since you mentioned the Bayes and the prior, I will add an answer in that direction to calculate the percentage of different groups. (see my comments on your question. I guess if you have an idea of the total and if you want to do a <code>groupby</code>, to estimate the percentage of different groups makes more sense).</p> <h2>The recursive Bayesian update:</h2> <p>I will start by assuming you have only two groups (extensions can be made to make it work for multiple groups, see later explanations for that.), <code>group1</code> and <code>group2</code>.</p> <p>For <code>m</code> <code>group1</code>s out of the first <code>n</code> lines(rows) you processed, we denote the event as <code>M(m,n)</code>. Obviously you will see <code>n-m</code> <code>group2</code>s because we assume they are the only two possible groups. So you know the conditional probability of the event <code>M(m,n)</code> given the percentage of <code>group1</code> (<code>s</code>), is given by the binomial distribution with <code>n</code> trials. We are trying to estimate <code>s</code> in a bayesian way.</p> <p>The conjugate prior for binomial is beta distribution. So for simplicity, we choose <code>Beta(1,1)</code> as the prior (of course, you can pick your own parameters here for <code>alpha</code> and <code>beta</code>), which is a uniform distribution on (0,1). Therefor, for this beta distribution, <code>alpha=1</code> and <code>beta=1</code>. </p> <p>The recursive update formulas for a binomial + beta prior are as below:</p> <pre><code>if group == 'group1': alpha = alpha + 1 else: beta = beta + 1 </code></pre> <p>The posterior of <code>s</code> is actually also a beta distribution:</p> <pre><code> s^(m+alpha-1) (1-s)^(n-m+beta-1) p(s| M(m,n)) = ----------------------------------- = Beta (m+alpha, n-m+beta) B(m+alpha, n-m+beta) </code></pre> <p>where <code>B</code> is the <a href="http://en.wikipedia.org/wiki/Beta_function" rel="nofollow">beta function</a>. To report the estimate result, you can rely on <a href="http://en.wikipedia.org/wiki/Beta_distribution" rel="nofollow"><code>Beta</code> distribution's</a> mean and variance, where:</p> <pre><code>mean = alpha/(alpha+beta) var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1)) </code></pre> <h2>The python code: <code>groupby.py</code></h2> <p>So a few lines of python to process your data from <code>stdin</code> and estimate the percentage of <code>group1</code> would be something like below:</p> <pre class="lang-py prettyprint-override"><code>import sys alpha = 1. beta = 1. for line in sys.stdin: data = line.strip() if data == 'group1': alpha += 1. elif data == 'group2': beta += 1. else: continue mean = alpha/(alpha+beta) var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1)) print 'mean = %.3f, var = %.3f' % (mean, var) </code></pre> <h2>The sample data</h2> <p>I feed a few lines of data to the code:</p> <pre><code>group1 group1 group1 group1 group2 group2 group2 group1 group1 group1 group2 group1 group1 group1 group2 </code></pre> <h2>The approximate estimation result</h2> <p>And here is what I get as results:</p> <pre><code>mean = 0.667, var = 0.056 mean = 0.750, var = 0.037 mean = 0.800, var = 0.027 mean = 0.833, var = 0.020 mean = 0.714, var = 0.026 mean = 0.625, var = 0.026 mean = 0.556, var = 0.025 mean = 0.600, var = 0.022 mean = 0.636, var = 0.019 mean = 0.667, var = 0.017 mean = 0.615, var = 0.017 mean = 0.643, var = 0.015 mean = 0.667, var = 0.014 mean = 0.688, var = 0.013 mean = 0.647, var = 0.013 </code></pre> <p>The result shows that group1 is estimated to have 64.7% percent up to the 15th row processed (based on our beta(1,1) prior). You might notice that the variance keeps shrinking because we have more and more observation points. </p> <h2>Multiple groups</h2> <p>Now if you have more than 2 groups, just change the underline distribution from binomial to multinomial, and then the corresponding <a href="http://en.wikipedia.org/wiki/Conjugate_prior" rel="nofollow">conjugate prior</a> would be Dirichlet. Everything else you just make similar changes. </p> <h2>Further notes</h2> <p>You said you would like the approximate estimate in 3-4 seconds. In this case, you just sample a portion of your data and feed the output to the above script, e.g., </p> <pre><code>head -n100000 YOURDATA.txt | python groupby.py </code></pre> <p>That's it. Hope it helps. </p>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POFastest approximate counting algorithm
  singulars
  PostTypePostTypeId
  PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USgreeness
UserOwnerUserId
1. USgreeness
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POFastest approximate counting algorithm
  singulars
  PostTypePostTypeId
  PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
  singulars
  PostPostId
  PO
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
2. VO
  singulars
  PostPostId
  PO
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
3. VO
  singulars
  PostPostId
  PO
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
CommentsPostId
1. CONote that this is exactly what I was suggesting, with the addition of the prior. Unless your prior variance is tiny (i.e. alpha + beta is on the same order of magnitude as the sample size), the posterior mean and the ML estimate (what I was suggesting) are going to be identical for most practical purposes. I'm not clear what the advantage of a Bayesian method would be here (which is why I didn't respond to that part of the question :))
  singulars
  PostPostId
  PO
  UserUserId
  USBen Allison
2. COAgree. The OP might be able to take advantage of the Bayesian by using an appropriate prior or/and sampling only a small portion of his data. Otherwise, the posterior mean and the maximum-likelihood estimate is almost identical.
  singulars
  PostPostId
  PO
  UserUserId
  USgreeness

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.