Note that there are some explanatory texts on larger screens.

plurals
  1. POUnderstanding algorithms for measuring trends
    primarykey
    data
    text
    <p>What's the rationale behind the formula used in the <code>hive_trend_mapper.py</code> program of <a href="http://www.cloudera.com/blog/2009/07/tracking-trends-with-hadoop-and-hive-on-ec2/" rel="nofollow noreferrer">this Hadoop tutorial</a> on calculating Wikipedia trends?</p> <p>There are actually two components: a monthly trend and a daily trend. I'm going to focus on the daily trend, but similar questions apply to the monthly one.</p> <p>In the daily trend, <code>pageviews</code> is an array of number of page views per day for this topic, one element per day, and <code>total_pageviews</code> is the sum of this array:</p> <pre><code># pageviews for most recent day y2 = pageviews[-1] # pageviews for previous day y1 = pageviews[-2] # Simple baseline trend algorithm slope = y2 - y1 trend = slope * log(1.0 +int(total_pageviews)) error = 1.0/sqrt(int(total_pageviews)) return trend, error </code></pre> <p>I know what it's doing superficially: it just looks at the change over the past day (<code>slope</code>), and scales this up to the log of <code>1+total_pageviews</code> (<code>log(1)==0</code>, so this scaling factor is non-negative). It can be seen as treating the month's total pageviews as a weight, but tempered as it grows - this way, the total pageviews stop making a difference for things that are "popular enough," but at the same time big changes on insignificant don't get weighed as much.</p> <p>But <em>why</em> do this? Why do we want to discount things that were initially unpopular? Shouldn't big deltas matter <em>more</em> for items that have a low constant popularity, and <em>less</em> for items that are already popular (for which the big deltas might fall well within a fraction of a standard deviation)? As a strawman, why not simply take <code>y2-y1</code> and be done with it?</p> <p>And what would the <code>error</code> be useful for? The tutorial doesn't really use it meaningfully again. Then again, it doesn't tell us how <code>trend</code> is used either - this is what's plotted in the end product, correct?</p> <p>Where can I read up for a (preferably introductory) background on the theory here? Is there a name for this madness? Is this a textbook formula somewhere?</p> <p>Thanks in advance for any answers (or discussion!).</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload