Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>This is a pretty standard problem for which there is a well-known solution. You simply sort the log files on each computer by URL and then merge them through a priority queue of size k (the number of items you want) on the "master" computer. This technique has been around since the 1960s, and is still in use today (although slightly modified) in the form of <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce</a>.</p> <p>On each computer, extract the URL and the count from the log file, and sort by URL. Because the log files are larger than will fit into memory, you need to do an on-disk merge. That entails reading a chunk of the log file, sorting by URL, writing the chunk to disk. Reading the next chunk, sorting, writing to disk, etc. At some point, you have M log file chunks, each sorted. You can then do an M-way merge. But instead of writing items to disk, you present them, in sorted order (sorted by URL, that is), to the "master".</p> <p>Each machine sorts its own log.</p> <p>The "master" computer merges the data from the separate computers and does the top K selection. This is actually two problems, but can be combined into one.</p> <p>The master creates two priority queues: one for the merge, and one for the top K selection. The first is of size N, where N is the number of computers it's merging data from. The second is of size K: the number of items you want to select. I use a min heap for this, as it's easy and reasonably fast.</p> <p>To set up the merge queue, initialize the queue and get the first item from each of the "worker" computers. In the pseudo-code below, "get lowest item from merge queue" means getting the root item from the merge queue and then getting the next item from whichever working computer presented that item. So if the queue contains <code>[1, 2, 3]</code>, and the items came from computers B, C, A (in that order), then taking the lowest item would mean getting the next item from computer B and adding it to the priority queue.</p> <p>The master then does the following:</p> <pre><code>working = get lowest item from merge queue while (items left to merge) { temp = get lowest item from merge queue while (temp.url == working.url) { working.count += temp.count temp = get lowest item from merge queue } // Now have merged counts for one url. if (topK.Count &lt; desired_count) { // topK queue doesn't have enough items yet. // so add this one. topK.Add(working); } else if (topK.Peek().count &lt; working.count) { // the count for this url is larger // than the smallest item on the heap // replace smallest on the heap with this one topK.RemoveRoot() topK.Add(working) } working = temp; } // Here you need to check the last item: if (topK.Peek().count &lt; working.count) { // the count for this url is larger // than the smallest item on the heap // replace smallest on the heap with this one topK.RemoveRoot() topK.Add(working) } </code></pre> <p>At this point, the <code>topK</code> queue has the K items with the highest counts.</p> <p>So each computer has to do a merge sort, which is O(n log n), where <code>n</code> is the number of items in that computer's log. The merge on the master is O(n), where <code>n</code> is the sum of all the items from the individual computers. Picking the top k items is O(n log k), where <code>n</code> is the number of <em>unique</em> urls.</p> <p>The sorts are done in parallel, of course, with each computer preparing its own sorted list. But the "merge" part of the sort is done at the same time the master computer is merging, so there is some coordination, and all machines are involved at that stage.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload