StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Check out NLTK package: <a href="http://www.nltk.org" rel="noreferrer">http://www.nltk.org</a> it has everything what you need</p> <p>For the cosine_similarity:</p> <pre><code> def cosine_distance(u, v): """ Returns the cosine of the angle between vectors v and u. This is equal to u.v / |u||v|. """ return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) </code></pre> <p>For ngrams:</p> <pre><code> def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): """ A utility that produces a sequence of ngrams from a sequence of items. For example: >>> ngrams([1,2,3,4,5], 3) [(1, 2, 3), (2, 3, 4), (3, 4, 5)] Use ingram for an iterator version of this function. Set pad_left or pad_right to true in order to get additional ngrams: >>> ngrams([1,2,3,4,5], 2, pad_right=True) [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)] @param sequence: the source data to be converted into ngrams @type sequence: C{sequence} or C{iterator} @param n: the degree of the ngrams @type n: C{int} @param pad_left: whether the ngrams should be left-padded @type pad_left: C{boolean} @param pad_right: whether the ngrams should be right-padded @type pad_right: C{boolean} @param pad_symbol: the symbol to use for padding (default is None) @type pad_symbol: C{any} @return: The ngrams @rtype: C{list} of C{tuple}s """ if pad_left: sequence = chain((pad_symbol,) * (n-1), sequence) if pad_right: sequence = chain(sequence, (pad_symbol,) * (n-1)) sequence = list(sequence) count = max(0, len(sequence) - n + 1) return [tuple(sequence[i:i+n]) for i in range(count)] </code></pre> <p>for tf-idf you will have to compute distribution first, I am using Lucene to do that, but you may very well do something similar with NLTK, use FreqDist:</p> <p><a href="http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term" rel="noreferrer">http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term</a></p> <p>if you like pylucene, this will tell you how to comute tf.idf</p> <pre><code> # reader = lucene.IndexReader(FSDirectory.open(index_loc)) docs = reader.numDocs() for i in xrange(docs): tfv = reader.getTermFreqVector(i, fieldname) if tfv: rec = {} terms = tfv.getTerms() frequencies = tfv.getTermFrequencies() for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)): df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term tmap.setdefault(t, len(tmap)) rec[t] = sim.tf(f) * sim.idf(df, max_doc) #compute TF.IDF # and normalize the values using cosine normalization if cosine_normalization: denom = sum([x**2 for x in rec.values()])**0.5 for k,v in rec.items(): rec[k] = v / denom </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload