StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
3839102
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2010-10-01T12:25:24.123
FavoriteCount
0
LastActivityDate
2010-10-01T12:25:24.123
LastEditDate
LastEditorUserId
0
OwnerUserId
114196
ParentId
3836553
PostTypeId
2
Score
1
ViewCount
0
LastEditorDisplayName
text
Body
If you want Hadoop to process a 100 GiB apache logfile "line by line" you are essentially doing the same as what you want: A large body of text split into pieces. The normal way for doing that in Hadoop (as you tagged the question with this) is using the <a href="http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java?view=markup" rel="nofollow">TextInputFormat</a> which uses <a href="http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?view=markup" rel="nofollow">LineRecordReader</a> which uses <a href="http://svn.apache.org/viewvc/hadoop/common/trunk/src/java/org/apache/hadoop/util/LineReader.java?view=markup" rel="nofollow">LineReader</a> to split the Text file on the "end-of-line" separator. What you want is essentially the same with one difference: split on something different. Sorting the resulting values (in Hadoop) is essentially done by employing what is called "Secondary Sort" (<a href="http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/examples/org/apache/hadoop/examples/SecondarySort.java?view=markup" rel="nofollow">See the Hadoop example</a> and <a href="http://books.google.nl/books?id=bKPEwR-Pt6EC&pg=PA227&lpg=PA227&dq=hadoop+definitive+guide+secondary+sort&source=bl&hl=en#v=onepage&q&f=false" rel="nofollow">the explanation in Tom's book</a>). So what I would recommend doing is <ol> <li>Make your own variation on <a href="http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java?view=markup" rel="nofollow">TextInputFormat</a>/<a href="http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?view=markup" rel="nofollow">LineRecordReader</a>/<a href="http://svn.apache.org/viewvc/hadoop/common/trunk/src/java/org/apache/hadoop/util/LineReader.java?view=markup" rel="nofollow">LineReader</a> that reads and extracts the individual parts of your String based on you separator. </li> <li>Create a map that rewrites the information to do Secondary Sort.</li> <li>Create the correct partition, group and key comparator classes/methods to do the sorting.</li> <li>Create a reduce where you receive the sorted information which you can the process further.</li> </ol> HTH
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. PORead a long string into memory
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USNiels Basjes
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.