Note that there are some explanatory texts on larger screens.

plurals
  1. POEfficient algorithm to randomly select items with frequency
    primarykey
    data
    text
    <p>Given an array of <code>n</code> word-frequency pairs:</p> <pre>[ (w<sub>0</sub>, f<sub>0</sub>), (w<sub>1</sub>, f<sub>1</sub>), ..., (w<sub>n-1</sub>, f<sub>n-1</sub>) ]</pre> <p>where <code>w<sub>i</sub></code> is a word, <code>f<sub>i</sub></code> is an integer frequencey, and the sum of the frequencies <code>&sum;f<sub>i</sub> = m</code>,</p> <p>I want to use a pseudo-random number generator (pRNG) to select <code>p</code> words <code>w<sub>j<sub>0</sub></sub>, w<sub>j<sub>1</sub></sub>, ..., w<sub>j<sub>p-1</sub></sub></code> such that the probability of selecting any word is proportional to its frequency:</p> <pre>P(w<sub>i</sub> = w<sub>j<sub>k</sub></sub>) = P(i = j<sub>k</sub>) = f<sub>i</sub> / m</pre> <p>(Note, this is selection with replacement, so the same word <em>could</em> be chosen every time).</p> <p>I've come up with three algorithms so far:</p> <ol> <li><p>Create an array of size <code>m</code>, and populate it so the first <code>f<sub>0</sub></code> entries are <code>w<sub>0</sub></code>, the next <code>f<sub>1</sub></code> entries are <code>w<sub>1</sub></code>, and so on, so the last <code>f<sub>p-1</sub></code> entries are <code>w<sub>p-1</sub></code>.<pre>[ w<sub>0</sub>, ..., w<sub>0</sub>, w<sub>1</sub>,..., w<sub>1</sub>, ..., w<sub>p-1</sub>, ..., w<sub>p-1</sub> ]</pre> Then use the pRNG to select <code>p</code> indices in the range <code>0...m-1</code>, and report the words stored at those indices.<br> This takes <code>O(n + m + p)</code> work, which isn't great, since <code>m</code> can be much much larger than n.</p></li> <li><p>Step through the input array once, computing<pre>m<sub>i</sub> = &sum;<sub>h&le;i</sub>f<sub>h</sub> = m<sub>i-1</sub> + f<sub>i</sub></pre> and after computing <code>m<sub>i</sub></code>, use the pRNG to generate a number <code>x<sub>k</sub></code> in the range <code>0...m<sub>i</sub>-1</code> for each <code>k</code> in <code>0...p-1</code> and select <code>w<sub>i</sub></code> for <code>w<sub>j<sub>k</sub></sub></code> (possibly replacing the current value of <code>w<sub>j<sub>k</sub></sub></code>) if <code>x<sub>k</sub> &lt; f<sub>i</sub></code>.<br> This requires <code>O(n + np)</code> work.</p></li> <li>Compute <code>m<sub>i</sub></code> as in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:<pre>[ (w<sub>0</sub>, f<sub>0</sub>, m<sub>0</sub>), (w<sub>1</sub>, f<sub>1</sub>, m<sub>1</sub>), ..., (w<sub>n-1</sub>, f<sub>n-1</sub>, m<sub>n-1</sub>) ]</pre> and then, for each k in <code>0...p-1</code>, use the pRNG to generate a number <code>x<sub>k</sub></code> in the range <code>0...m-1</code> then do binary search on the array of triples to find the <code>i</code> s.t. <code>m<sub>i</sub>-f<sub>i</sub> &le; x<sub>k</sub> &lt; m<sub>i</sub></code>, and select <code>w<sub>i</sub></code> for <code>w<sub>j<sub>k</sub></sub></code>.<br> This requires <code>O(n + p log n)</code> work.</li> </ol> <p><strong>My question is</strong>: Is there a more efficient algorithm I can use for this, or are these as good as it gets?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload