Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>First thing to notice is that this can be approximated by a local problem. That is to say, a "trending" word really depends only upon recent data. So immediately we can truncate our data to the most recent <code>N</code> days where <code>N</code> is some experimentally determined optimal value. This significantly cuts down on the amount of data we have to look at.</p> <p>In fact, the <a href="http://www.npr.org/2011/12/07/143013503/how-twitters-trending-algorithm-picks-its-topics" rel="nofollow">NPR article</a> suggests this.</p> <p>Then you need to somehow look at growth. And this is precisely what the derivative captures. First thing to do is normalize the data. Divide all your data points by the value of the first data point. This makes it so that the large growth of an infrequent word isn't drowned out by the relatively small growth of a popular word.</p> <p>For the first derivative, do something like this:</p> <pre><code>d[i] = (data[i] - data[i+k])/k </code></pre> <p>for some experimentally determined value of <code>k</code> (which, in this case, is a number of days). Similarly, the second derivative can be expressed as:</p> <pre><code>d2[i] = (data[i] - 2*data[i+k] + data[i+2k])/(2k) </code></pre> <p>Higher derivatives can also be expressed like this. Then you need to assign some kind of weighting system for these derivatives. This is a purely experimental procedure which really depends on what you want to consider "trending." For example, you might want to give acceleration of growth half as much weight as the velocity. Another thing to note is that you should try your best to remove noise from your data because derivatives are very sensitive to noise. You do this by carefully choosing your value for <code>k</code> as well as discarding words with very low frequencies altogether.</p> <p>I also notice that you multiply by the log sum of the frequencies. I presume this is to give the growth of popular words more weight (because more popular words are less likely to trend in the first place). The standard way of measuring how popular a word is is by looking at it's <a href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf" rel="nofollow">inverse document frequency</a> (IDF).</p> <p>I would divide by the IDF of a word to give the growth of more popular words more weight.</p> <pre><code>IDF[word] = log(D/(df[word)) </code></pre> <p>where <code>D</code> is the total number of documents (e.g. for Twitter it would be the total number of tweets) and <code>df[word]</code> is the number of documents containing <code>word</code> (e.g. the number of tweets containing a word).</p> <p>A high IDF corresponds to an unpopular word whereas a low IDF corresponds to a popular word.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload