StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p><strong>Two Stage Approach for Multiword Tags</strong></p> <p>You could <strong>pool all the tweets</strong> into a single larger document and then extract the <strong><em>n</em></strong> most interesting collocations from the whole collection of tweets. You could then go back and tag each tweet with the collocations that occur in it. Using this approach, <strong><em>n</em></strong> would be the total number of multiword tags that would be generated for the whole dataset.</p> <p>For the first stage, you could use the NLTK code posted <a href="https://stackoverflow.com/questions/2661778/tag-generation-from-a-text-content/2664351#2664351">here</a>. The second stage could be accomplished with just a simple for loop over all the tweets. However, if speed is a concern, you could use <a href="http://lucene.apache.org/pylucene/" rel="nofollow noreferrer">pylucene</a> to quickly find the tweets that contain each collocation.</p> <p><strong>Tweet Level PMI for Single Word Tags</strong></p> <p>As also suggested <a href="https://stackoverflow.com/questions/2661778/tag-generation-from-a-text-content/2664351#2664351">here</a>, For single word tags, you could calculate the <a href="http://en.wikipedia.org/wiki/Pointwise_mutual_information" rel="nofollow noreferrer">point-wise mutual information</a> of each individual word and the tweet itself, i.e. </p> <pre><code>PMI(term, tweet) = log [ P(term, tweet) / (P(term)*P(tweet)) </code></pre> <p>Again, this will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection. You could then tag the tweet with a few terms that have the highest <code>PMI</code> with the tweet.</p> <p><strong>General Changes for Tweets</strong></p> <p>Some changes you might want to make when tagging with tweets include:</p> <ul> <li><p>Only use a word or collocation as a tag for a tweet, if it occurs within a <strong>certain number or percentage of other tweets</strong>. Otherwise, PMI will tend to tag tweets with odd terms that occur in just one tweet but that are not seen anywhere else, e.g. misspellings and keyboard noise like #@$#@$%!. </p></li> <li><p>Scale the number of tags used with the length of each tweet. You might be able to extract 2 or 3 interesting tags for longer tweets. But, for a shorter 2 word tweet, you probably <strong>don't want to use every single word and collocation to tag it</strong>. It's probably worth experimenting with different cut-offs for how many tags you want to extract given the tweet length.</p></li> </ul>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload