Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>It appears that you're building what the IR (information retrieval) community calls an inverted index. In that case, I agree with the overall approach you're taking, but also recommend that you use the counter class <strong>in conjunction with</strong> default dict...</p> <pre class="lang-py prettyprint-override"><code>counters.form = collections.defaultdict(collections.Counter) </code></pre> <p><code>counters.form</code> would then act as a sort of index of a compressed world model, where absence of observations isn't an error (nor False), just 0.</p> <p>Using your <code>form</code> data as an example, we populate the inverted index like...</p> <pre><code>#-- Build the example data into the proposed structure... counters.form['a'].update({'email1':4, 'email2':3}) counters.form['the'].update({'email1':2, 'email3':4}) counters.form['or'].update({'email1':2, 'email3':1}}) </code></pre> <p>Now, to get the frequency of a form in this data, we dereference like it was a 2d array...</p> <pre><code>print counters.form['a']['email2'] </code></pre> <p>...which should print <code>3</code> and is more-or-less the same as the structure you are currently using. The real difference of these two approaches is when you have no observations. For instance...</p> <pre><code>print counters.form['noword']['some-email'] </code></pre> <p>...using your current structure (<code>collections.defaultdict(dict)</code>), the get of 'noword' on the <code>counters.form</code> would 'miss' and the defaultdict would automatically associate a newly constructed, empty dictionary to <code>counters.form['noword']</code>; however, when this empty dict is then queried for the key: 'some-email', the empty dict has no such key, resulting in a <code>KeyError</code> exception for 'some-email'</p> <p>If instead we use the suggested structure (<code>collections.defaultdict(collections.Counter)</code>), then the get of 'noword' on <code>counters.form</code> would miss, and a new <code>collections.Counter</code> would be associated to the key 'noword'. When the counter is then queried (in the second dereference) for 'some-email', the counter will respond 0 -- which is (I believe) the desired behavior.</p> <p>Some other recipes...</p> <pre><code>#-- Show distinct emails which contain 'someword' emails = list(counters.form['someword']) #-- Show tally of all observations of 'someword' tally = sum(counters.form['someword'].values( )) </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload