StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Welcome to the world of NLP ^_^</p> <p>All you need is a little basic knowledge and some tools.</p> <p>There are already tools that will tell you if a word in a sentence is a noun, adjective or verb. They are called <a href="http://en.wikipedia.org/wiki/Part-of-speech_tagging" rel="noreferrer">part-of-speech taggers</a>. Typically, they take plaintext English as input, and output the word, its base form, and the part-of-speech. Here is the output of a popular UNIX part-of-speech tagger on the first sentence of your post:</p> <pre><code>$ echo "Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text." | tree-tagger-english # Word POS surface form Without IN without getting VVG get a DT a degree NN degree in IN in information NN information retrieval NN retrieval , , , I PP I 'd MD will like VV like to TO to know VV know if IN if there EX there exists VVZ exist any DT any algorithms NNS algorithm for IN for counting VVG count the DT the frequency NN frequency that IN/that that words NNS word occur VVP occur in IN in a DT a given VVN give body NN body of IN of text NN text . SENT . </code></pre> <p>As you can see, it identified "algorithms" as being the plural form (NNS) of "algorithm" and "exists" as being a conjugation (VBZ) of "exist." It also identified "a" and "the" as "determiners (DT)" -- another word for article. As you can see, the POS tagger also tokenized the punctuation.</p> <p>To do everything but the last point on your list, you just need to run the text through a POS tagger, filter out the categories that don't interest you (determiners, pronouns, etc.) and count the frequencies of the base forms of the words.</p> <p>Here are some popular POS taggers:</p> <p><a href="http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/" rel="noreferrer">TreeTagger</a> (binary only: Linux, Solaris, OS-X)<br> <a href="http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/" rel="noreferrer">GENIA Tagger</a> (C++: compile your self)<br> <a href="http://nlp.stanford.edu/software/tagger.shtml" rel="noreferrer">Stanford POS Tagger</a> (Java) </p> <p>To do the last thing on your list, you need more than just word-level information. An easy way to start is by counting <em>sequences</em> <em>of</em> <em>words</em> rather than just words themselves. These are called <a href="http://en.wikipedia.org/wiki/N-gram" rel="noreferrer">n-grams</a>. A good place to start is <a href="http://people.sslmit.unibo.it/~baroni/compling04/UnixforPoets.pdf" rel="noreferrer">UNIX for Poets</a>. If you are willing to invest in a book on NLP, I would recommend <a href="http://rads.stackoverflow.com/amzn/click/0262133601" rel="noreferrer">Foundations of Statistical Natural Language Processing</a>.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload