Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>If you want Hadoop to process a 100 GiB apache logfile "line by line" you are essentially doing the same as what you want: A large body of text split into pieces.</p> <p>The normal way for doing that in Hadoop (as you tagged the question with this) is using the <a href="http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java?view=markup" rel="nofollow">TextInputFormat</a> which uses <a href="http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?view=markup" rel="nofollow">LineRecordReader</a> which uses <a href="http://svn.apache.org/viewvc/hadoop/common/trunk/src/java/org/apache/hadoop/util/LineReader.java?view=markup" rel="nofollow">LineReader</a> to split the Text file on the "end-of-line" separator. What you want is essentially the same with one difference: split on something different.</p> <p>Sorting the resulting values (in Hadoop) is essentially done by employing what is called "Secondary Sort" (<a href="http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/examples/org/apache/hadoop/examples/SecondarySort.java?view=markup" rel="nofollow">See the Hadoop example</a> and <a href="http://books.google.nl/books?id=bKPEwR-Pt6EC&amp;pg=PA227&amp;lpg=PA227&amp;dq=hadoop+definitive+guide+secondary+sort&amp;source=bl&amp;hl=en#v=onepage&amp;q&amp;f=false" rel="nofollow">the explanation in Tom's book</a>).</p> <p>So what I would recommend doing is</p> <ol> <li>Make your own variation on <a href="http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java?view=markup" rel="nofollow">TextInputFormat</a>/<a href="http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?view=markup" rel="nofollow">LineRecordReader</a>/<a href="http://svn.apache.org/viewvc/hadoop/common/trunk/src/java/org/apache/hadoop/util/LineReader.java?view=markup" rel="nofollow">LineReader</a> that reads and extracts the individual parts of your String based on you separator. </li> <li>Create a map that rewrites the information to do Secondary Sort.</li> <li>Create the correct partition, group and key comparator classes/methods to do the sorting.</li> <li>Create a reduce where you receive the sorted information which you can the process further.</li> </ol> <p>HTH</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload