Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I have a semi-working system that solves this problem, open sourced using scikit-learn, with a series of blog posts describing what I'm doing. The problem I'm tackling is word-sense disambiguation (choosing one of multiple <a href="https://en.wikipedia.org/wiki/Word_sense" rel="nofollow noreferrer">word sense</a> options), which is not the same as Named Entity Recognition. My basic approach is somewhat-competitive with existing solutions and (crucially) is customisable.</p> <p>There are some existing commercial NER tools (OpenCalais, DBPedia Spotlight, and AlchemyAPI) that might give you a good enough commercial result - do try these first!</p> <p>I used some of these for a client project (I consult using NLP/ML in London), but I wasn't happy with their recall (<a href="https://en.wikipedia.org/wiki/Precision_and_recall" rel="nofollow noreferrer">precision and recall</a>). Basically they can be precise (when they say "This is Apple Inc" they're typically correct), but with low recall (they rarely say "This is Apple Inc" even though to a human the tweet is obviously about Apple Inc). I figured it'd be an intellectually interesting exercise to build an open source version tailored to tweets. Here's the current code: <a href="https://github.com/ianozsvald/social_media_brand_disambiguator" rel="nofollow noreferrer">https://github.com/ianozsvald/social_media_brand_disambiguator</a></p> <p>I'll note - I'm not trying to solve the generalised word-sense disambiguation problem with this approach, just <strong>brand</strong> disambiguation (companies, people, etc.) when you already have their name. That's why I believe that this straightforward approach will work.</p> <p>I started this six weeks ago, and it is written in Python 2.7 using scikit-learn. It uses a very basic approach. I vectorize using a binary count vectorizer (I only count whether a word appears, not how many times) with 1-3&nbsp;<a href="https://en.wikipedia.org/wiki/N-gram" rel="nofollow noreferrer">n-grams</a>. I don't scale with TF-IDF (TF-IDF is good when you have a variable document length; for me the tweets are only one or two sentences, and my testing results didn't show improvement with TF-IDF).</p> <p>I use the basic tokenizer which is very basic but surprisingly useful. It ignores @ # (so you lose some context) and of course doesn't expand a URL. I then train using <a href="https://en.wikipedia.org/wiki/Logistic_regression" rel="nofollow noreferrer">logistic regression</a>, and it seems that this problem is somewhat linearly separable (lots of terms for one class don't exist for the other). Currently I'm avoiding any stemming/cleaning (I'm trying The Simplest Possible Thing That Might Work).</p> <p>The code has a full README, and you should be able to ingest your tweets relatively easily and then follow my suggestions for testing.</p> <p>This works for Apple as people don't eat or drink Apple computers, nor do we type or play with fruit, so the words are easily split to one category or the other. This condition may not hold when considering something like #definance for the TV show (where people also use #definance in relation to the Arab Spring, cricket matches, exam revision and a music band). Cleverer approaches may well be required here.</p> <p>I have <a href="http://ianozsvald.com/category/socialmediabranddisambiguator/" rel="nofollow noreferrer">a series of blog posts</a> describing this project including a one-hour presentation I gave at the BrightonPython usergroup (which turned into a shorter presentation for 140 people at DataScienceLondon).</p> <p>If you use something like LogisticRegression (where you get a probability for each classification) you can pick only the confident classifications, and that way you can force high precision by trading against recall (so you get correct results, but fewer of them). You'll have to tune this to your system.</p> <p>Here's a possible algorithmic approach using scikit-learn:</p> <ul> <li>Use a Binary CountVectorizer (I don't think term-counts in short messages add much information as most words occur only once)</li> <li>Start with a Decision Tree classifier. It'll have explainable performance (see <em><a href="http://ianozsvald.com/2013/07/07/overfitting-with-a-decision-tree/" rel="nofollow noreferrer">Overfitting with a Decision Tree</a></em> for an example).</li> <li>Move to logistic regression</li> <li>Investigate the errors generated by the classifiers (read the DecisionTree's exported output or look at the coefficients in LogisticRegression, work the mis-classified tweets back through the Vectorizer to see what the underlying Bag of Words representation looks like - there will be fewer tokens there than you started with in the raw tweet - are there enough for a classification?)</li> <li>Look at my example code in <a href="https://github.com/ianozsvald/social_media_brand_disambiguator/blob/master/learn1.py" rel="nofollow noreferrer">https://github.com/ianozsvald/social_media_brand_disambiguator/blob/master/learn1.py</a> for a worked version of this approach</li> </ul> <p>Things to consider:</p> <ul> <li>You need a larger dataset. I'm using 2000 labelled tweets (it took me five hours), and as a minimum you want a balanced set with >100 per class (see the overfitting note below)</li> <li>Improve the tokeniser (very easy with scikit-learn) to keep # @ in tokens, and maybe add a capitalised-brand detector (as user @user2425429 notes)</li> <li>Consider a non-linear classifier (like @oiez's suggestion above) when things get harder. Personally I found LinearSVC to do worse than logistic regression (but that may be due to the high-dimensional feature space that I've yet to reduce).</li> <li>A tweet-specific part of speech tagger (in my humble opinion not Standford's as @Neil suggests - it performs poorly on poor Twitter grammar in my experience)</li> <li>Once you have lots of tokens you'll probably want to do some dimensionality reduction (I've not tried this yet - see my blog post on LogisticRegression l1 l2 penalisation)</li> </ul> <p>Re. overfitting. In my dataset with 2000 items I have a 10 minute snapshot from Twitter of 'apple' tweets. About 2/3 of the tweets are for Apple Inc, 1/3 for other-apple-uses. I pull out a balanced subset (about 584 rows I think) of each class and do five-fold cross validation for training.</p> <p>Since I only have a 10 minute time-window I have many tweets about the same topic, and this is probably why my classifier does so well relative to existing tools - it will have overfit to the training features without generalising well (whereas the existing commercial tools perform worse on this snapshop, but more reliably across a wider set of data). I'll be expanding my time window to test this as a subsequent piece of work.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload