StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
939237
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2009-06-02T12:26:32.497
FavoriteCount
0
LastActivityDate
2009-06-02T13:21:02.857
LastEditDate
2009-06-02T13:21:02.857
LastEditorUserId
25544
OwnerUserId
25544
ParentId
939100
PostTypeId
2
Score
16
ViewCount
0
LastEditorDisplayName
text
Body
Can the MapReduce concept really be applied to weblogs analysis ? Yes. You can split your hudge logfile into chunks of say 10,000 or 1,000,000 lines (whatever is a good chunk for your type of logfile - for apache logfiles I'd go for a larger number), feed them to some mappers that would extract something specific (like Browser,IP Address, ..., Username, ... ) from each log line, then reduce by counting the number of times each one appeared (simplified): <pre><code> 192.168.1.1,FireFox x.x,username1 192.168.1.1,FireFox x.x,username1 192.168.1.2,FireFox y.y,username1 192.168.1.7,IE 7.0,username1 </code></pre> You can extract browsers, ignoring version, using a map operation to get this list: <pre><code>FireFox FireFox FireFox IE </code></pre> Then reduce to get this : FireFox,3 IE,1 Is MapReduce the most clever way of doing it ? It's clever, but you would need to be very big in order to gain any benefit... Splitting PETABYTES of logs. To do this kind of thing, I would prefer to use Message Queues, and a consistent storage engine (like a database), with processing clients that pull work from the queues, perform the job, and push results to another queue, with jobs not being executed in some timeframe made available for others to process. These clients would be small programs that do something specific. You could start with 1 client, and expand to 1000... You could even have a client that runs as a screensaver on all the PCs on a LAN, and run 8 clients on your 8-core servers, 2 on your dual core PCs... With Pull: You could have 100 or 10 clients working, multicore machines could have multiple clients running, and whatever a client finishes would be available for the next step. And you don't need to do any hashing or assignment for the work to be done. It's 100% dynamic. <a href="http://img355.imageshack.us/img355/7355/mqlogs.png">http://img355.imageshack.us/img355/7355/mqlogs.png</a> How would you split the web log files between the various computing instances ? By number of elements or lines if it's a text-based logfile. In order to test MapReduce, I'd like to suggest that you play with Hadoop.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POHow is MapReduce a good method to analyse http server logs?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USOsama Al-Maadeed
UserOwnerUserId
1. USOsama Al-Maadeed
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POHow is MapReduce a good method to analyse http server logs?
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
CommentsPostId
1. COFirst of all, sorry for the delay. Thanks a lot for your very high-quality answer. It helps a lot !
 singulars
 PostPostId
 PO
 UserUserId
 USNicolas
2. COAs an alternative to splitting the log files, you could parallelize your "log analysis" script across n cores. And if you were to run this script on a virtualized cluster (of say, 96 cores), your code will run flawlessly without any changes. You need to identify and isolate the "smallest" unit of work that is side-effect free and deals with immutable data. This may require you to re-design code, possibly. Besides Hadoop is comparatively harder to setup (and where I live, expertise is harder to find).
 singulars
 PostPostId
 PO
 UserUserId
 USImran.Fanaswala

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.