Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Each <code>term</code> seems to have a term frequency, a doc id, and a list of positions. Is that right? If so, you could use a dict of dicts:</p> <pre><code>dct = { 'wassup' : { 'termfreq' : 'daily', 'docid' : 1, 'pos' : [3,4] }} </code></pre> <p>Then, given a term, like 'wassup', you could look up the term frequency with</p> <pre><code>dct['wassup']['termfreq'] # 'daily' </code></pre> <p>Think of a dict as being like a telephone book. It is great at looking up values (phone numbers) given keys (names). It is not so hot at looking up keys given values. Use a dict when you know you need to look things up in a one-way direction. You may need some other data structure (a database perhaps?) if your lookup patterns is more complex. </p> <hr> <p>You might also want to check out the <a href="http://www.nltk.org/" rel="nofollow">Natural Language Toolkit (nltk)</a>. It has a <a href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.text.TextCollection-class.html" rel="nofollow">method for calculating <code>tf_idf</code></a> built in:</p> <pre><code>import nltk # Given a corpus of texts text1 = 'Lorem ipsum FOO dolor BAR sit amet' text2 = 'Ut enim ad FOO minim veniam, ' text3 = 'Duis aute irure dolor BAR in reprehenderit ' text4 = 'Excepteur sint occaecat BAR cupidatat non proident' # We split the texts into tokens, and form a TextCollection mytexts = ( [nltk.word_tokenize(text) for text in [text1, text2, text3, text4]]) mycollection = nltk.TextCollection(mytexts) # Given a new text text = 'et FOO tu BAR Brute' tokens = nltk.word_tokenize(text) # for each token (roughly, word) in the new text, we compute the tf_idf for word in tokens: print('{w}: {s}'.format(w = word, s = mycollection.tf_idf(word,tokens))) </code></pre> <p>yields </p> <pre><code>et: 0.0 FOO: 0.138629436112 tu: 0.0 BAR: 0.0575364144904 Brute: 0.0 </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload