Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p><strong>Can the MapReduce concept really be applied to weblogs analysis ?</strong></p> <p>Yes. </p> <p>You can split your hudge logfile into chunks of say 10,000 or 1,000,000 lines (whatever is a good chunk for your type of logfile - for apache logfiles I'd go for a larger number), feed them to some mappers that would extract something specific (like Browser,IP Address, ..., Username, ... ) from each log line, then reduce by counting the number of times each one appeared (simplified):</p> <pre><code> 192.168.1.1,FireFox x.x,username1 192.168.1.1,FireFox x.x,username1 192.168.1.2,FireFox y.y,username1 192.168.1.7,IE 7.0,username1 </code></pre> <p>You can extract browsers, ignoring version, using a map operation to get this list:</p> <pre><code>FireFox FireFox FireFox IE </code></pre> <p>Then reduce to get this : FireFox,3 IE,1</p> <p><strong>Is MapReduce the most clever way of doing it ?</strong></p> <p>It's clever, but you would need to be very big in order to gain any benefit... Splitting PETABYTES of logs. </p> <p>To do this kind of thing, I would prefer to use Message Queues, and a consistent storage engine (like a database), with processing clients that pull work from the queues, perform the job, and push results to another queue, with jobs not being executed in some timeframe made available for others to process. These clients would be small programs that do something specific.</p> <p>You could start with 1 client, and expand to 1000... You could even have a client that runs as a screensaver on all the PCs on a LAN, and run 8 clients on your 8-core servers, 2 on your dual core PCs...</p> <p>With Pull: You could have 100 or 10 clients working, multicore machines could have multiple clients running, and whatever a client finishes would be available for the next step. And you don't need to do any hashing or assignment for the work to be done. It's 100% dynamic.</p> <p><a href="http://img355.imageshack.us/img355/7355/mqlogs.png">http://img355.imageshack.us/img355/7355/mqlogs.png</a></p> <p><strong>How would you split the web log files between the various computing instances ?</strong></p> <p>By number of elements or lines if it's a text-based logfile.</p> <p>In order to test MapReduce, I'd like to suggest that you play with Hadoop.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload