Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>@Ben Allison's answer is a good way if you want to count the total lines. Since you mentioned the Bayes and the prior, I will add an answer in that direction to calculate the percentage of different groups. (see my comments on your question. I guess if you have an idea of the total and if you want to do a <code>groupby</code>, to estimate the percentage of different groups makes more sense).</p> <h2>The recursive Bayesian update:</h2> <p>I will start by assuming you have only two groups (extensions can be made to make it work for multiple groups, see later explanations for that.), <code>group1</code> and <code>group2</code>.</p> <p>For <code>m</code> <code>group1</code>s out of the first <code>n</code> lines(rows) you processed, we denote the event as <code>M(m,n)</code>. Obviously you will see <code>n-m</code> <code>group2</code>s because we assume they are the only two possible groups. So you know the conditional probability of the event <code>M(m,n)</code> given the percentage of <code>group1</code> (<code>s</code>), is given by the binomial distribution with <code>n</code> trials. We are trying to estimate <code>s</code> in a bayesian way.</p> <p>The conjugate prior for binomial is beta distribution. So for simplicity, we choose <code>Beta(1,1)</code> as the prior (of course, you can pick your own parameters here for <code>alpha</code> and <code>beta</code>), which is a uniform distribution on (0,1). Therefor, for this beta distribution, <code>alpha=1</code> and <code>beta=1</code>. </p> <p>The recursive update formulas for a binomial + beta prior are as below:</p> <pre><code>if group == 'group1': alpha = alpha + 1 else: beta = beta + 1 </code></pre> <p>The posterior of <code>s</code> is actually also a beta distribution:</p> <pre><code> s^(m+alpha-1) (1-s)^(n-m+beta-1) p(s| M(m,n)) = ----------------------------------- = Beta (m+alpha, n-m+beta) B(m+alpha, n-m+beta) </code></pre> <p>where <code>B</code> is the <a href="http://en.wikipedia.org/wiki/Beta_function" rel="nofollow">beta function</a>. To report the estimate result, you can rely on <a href="http://en.wikipedia.org/wiki/Beta_distribution" rel="nofollow"><code>Beta</code> distribution's</a> mean and variance, where:</p> <pre><code>mean = alpha/(alpha+beta) var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1)) </code></pre> <h2>The python code: <code>groupby.py</code></h2> <p>So a few lines of python to process your data from <code>stdin</code> and estimate the percentage of <code>group1</code> would be something like below:</p> <pre class="lang-py prettyprint-override"><code>import sys alpha = 1. beta = 1. for line in sys.stdin: data = line.strip() if data == 'group1': alpha += 1. elif data == 'group2': beta += 1. else: continue mean = alpha/(alpha+beta) var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1)) print 'mean = %.3f, var = %.3f' % (mean, var) </code></pre> <h2>The sample data</h2> <p>I feed a few lines of data to the code:</p> <pre><code>group1 group1 group1 group1 group2 group2 group2 group1 group1 group1 group2 group1 group1 group1 group2 </code></pre> <h2>The approximate estimation result</h2> <p>And here is what I get as results:</p> <pre><code>mean = 0.667, var = 0.056 mean = 0.750, var = 0.037 mean = 0.800, var = 0.027 mean = 0.833, var = 0.020 mean = 0.714, var = 0.026 mean = 0.625, var = 0.026 mean = 0.556, var = 0.025 mean = 0.600, var = 0.022 mean = 0.636, var = 0.019 mean = 0.667, var = 0.017 mean = 0.615, var = 0.017 mean = 0.643, var = 0.015 mean = 0.667, var = 0.014 mean = 0.688, var = 0.013 mean = 0.647, var = 0.013 </code></pre> <p>The result shows that group1 is estimated to have 64.7% percent up to the 15th row processed (based on our beta(1,1) prior). You might notice that the variance keeps shrinking because we have more and more observation points. </p> <h2>Multiple groups</h2> <p>Now if you have more than 2 groups, just change the underline distribution from binomial to multinomial, and then the corresponding <a href="http://en.wikipedia.org/wiki/Conjugate_prior" rel="nofollow">conjugate prior</a> would be Dirichlet. Everything else you just make similar changes. </p> <h2>Further notes</h2> <p>You said you would like the approximate estimate in 3-4 seconds. In this case, you just sample a portion of your data and feed the output to the above script, e.g., </p> <pre><code>head -n100000 YOURDATA.txt | python groupby.py </code></pre> <p>That's it. Hope it helps. </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload