Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>The data in <code>[publicdata:samples.github_timeline]</code> looks like snapshots of every repository at different timestamps. If that is the case, to calculate the change of fork number per repo per month, I don't think you should do <code>SUM(repository_forks)</code>. Instead you want to get the first snapshot and the last snapshot of every month and do a <code>minus</code> calculation to get the <code>delta</code>. </p> <p>The result is from the below query:</p> <pre class="lang-sql prettyprint-override"><code>select repository_name, created_at, repository_forks from [publicdata:samples.github_timeline] where repository_name='Bukkit' order by created_at; </code></pre> <p><img src="https://i.stack.imgur.com/Ydauk.png" alt="enter image description here"></p> <p>However, I don't understand why at <code>2012-03-11 08:30:21</code>, the number of repository_forks from <code>Bukkit</code> is zero. It might be a data error? If it is data error, I will treat them as outliers. Setting some threshold on it might be able to remove those outliers. Note the threshold I set: <code>where repository_forks &gt; 10</code> in order to skip the bad data.</p> <pre class="lang-sql prettyprint-override"><code>SELECT top100.repository_name, substr(created_at, 0, 7) month, max(repository_forks)-min(repository_forks) monthly_increase, min(repository_forks) monthly_begin_at, max(repository_forks) monthly_end_with FROM [githubarchive:github.timeline] timeline JOIN (SELECT repository_name , MAX(repository_forks) as forks FROM [githubarchive:github.timeline] WHERE (created_at CONTAINS "2012-04-01") GROUP BY repository_name ORDER BY forks DESC LIMIT 100) top100 on timeline.repository_name = top100.repository_name where repository_forks &gt; 10 GROUP BY top100.repository_name, month ORDER BY top100.repository_name, month; </code></pre> <p>And the result looks like:</p> <p><img src="https://i.stack.imgur.com/U7edO.png" alt="enter image description here"></p> <p>If I am wrong and the number of repository_forks is already a change, you can go ahead and do the sum over repository_forks as what you did. Then it's actually easier:</p> <pre><code>SELECT repository_name, substr(created_at,0,7) as month, SUM(repository_forks) as forks FROM [publicdata:samples.github_timeline] timeline JOIN (SELECT repository_url , MAX(repository_forks) as forks FROM [publicdata:samples.github_timeline] WHERE (created_at CONTAINS "2012-04-01") GROUP BY repository_url ORDER BY forks DESC LIMIT 100) top100 on timeline.repository_url = top100.repository_url GROUP BY repository_name, month ORDER BY repository_name, month DESC; </code></pre> <p><img src="https://i.stack.imgur.com/AhkSu.png" alt="enter image description here"></p> <h2>Update:</h2> <p>yes. I changed the dataset to point to <code>githubarchive:github.timeline</code>, then I have data until December, 2012. Corresponding <code>sql</code> and results are updated. But the data quality is not good, still see a lot of <code>outlier</code> data points. </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload