StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>NLP techniques are relatively ill equipped to deal with this kind of text.</p> <p>Phrased differently: it is quite possible to build a solution which includes NLP processes to implement the desired classifier but the added complexity doesn't necessarily pays off in term of speed of development nor classifier precision improvements.<br> If one really insists on using NLP techniques, POS-tagging and its ability to identify nouns is the most obvious idea, but Chunking and access to WordNet or other lexical sources are other plausible uses of NLTK.</p> <p>Instead, an ad-hoc solution based on simple regular expressions and a few heuristics such as these suggested by <em>NoBugs</em> is probably an appropriate approach to the problem. Certainly, such solutions bear two main risks:</p> <ul> <li>over-fitting to the portion of the text reviewed/considered in building the rules</li> <li>possible messiness/complexity of the solution if too many rules and sub-rules are introduced.</li> </ul> <p>Running some plain statical analysis on the complete (or very big sample) of the texts to be considered should help guide the selection of a few heuristics and also avoid the over-fitting concerns. I'm quite sure that a relatively small number of rules, associated with a custom dictionary should be sufficient to produce a classifier with appropriate precision as well as speed/resources performance.</p> <p>A few ideas:</p> <ul> <li>count all the words (and possibly all the bi-grams and tri-grams) in a sizable portion of the corpus a hand. This info can drive the design of the classifier by allowing to allocate the most effort and the most rigid rules to the most common patterns. </li> <li>manually introduce a short dictionary which associates the most popular words with: <ul> <li>their POS function (mostly a binary matter here: i.e. nouns vs. modifiers and other non-nouns.</li> <li>their synonym root [if applicable]</li> <li>their class [if applicable]</li> </ul></li> <li>If the pattern holds for most of the input text, consider using the last word before the end of text or before the first comma as the main key to class selection. If the pattern doesn't hold, just give more weight to the first and to the last word.</li> <li>consider a first pass where the text is re-written with the most common bi-grams replaced by a single word (even an artificial code word) which would be in the dictionary</li> <li>consider also replacing the most common typos or synonyms with their corresponding synonym root. Adding regularity in the input helps improve precision and also help making a few rules / a few entries in the dictionary have a big return on precision.</li> <li>for words not found in dictionary, assume that words which are mixed with numbers and/or preceded by numbers are modifiers, not nouns. Assume that the </li> <li>consider a two-tiers classification whereby inputs which cannot be plausibly assigned a class are put in the "manual pile" to prompt additional review which results in additional of rules and/or dictionary entries. After a few iterations the classifier should require less and less improvements and tweaks.</li> <li>look for non-obvious features. For example some corpora are made from a mix of sources but some of the sources, may include particular regularities which help identify the source and/or be applicable as classification hints. For example some sources may only contains say uppercase text (or text typically longer than 50 characters, or truncated words at the end etc.)</li> </ul> <p>I'm afraid this answer falls short of providing Python/NLTK snippets as a primer towards a solution, but frankly such simple NLTK-based approaches are likely to be disappointing at best. Also, we should have a much bigger sample set of the input text to guide the selection of plausible approaches, include ones that are based on NLTK or NLP techniques at large.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload