Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>You can write Hadoop mapreduce transformations either as "streaming" or as a "custom jar". If you use streaming, you can write your code in any language you like, including Python or C++. Your code will just read from STDIN and output to STDOUT. However, on hadoop versions before 0.21, hadoop streaming used to only stream text - not binary - to your processes. Therefore your files needed to be text files, unless you do some funky encoding transformations yourself. But now it appears a <a href="https://issues.apache.org/jira/browse/HADOOP-1722" rel="noreferrer">patch</a> has been added that now allows the use of binary formats with hadoop streaming.</p> <p>If you use a "custom jar" (i.e. you wrote your mapreduce code in Java or Scala using the hadoop libraries), then you will have access to functions that allow you to input and output binary (serialize in binary) from your streaming processes (and save the results to disk). So future runs will be much faster (depending on how much your binary format is smaller than your text format).</p> <p>So if your hadoop job is going to be I/O bound, then the "custom jar" approach will be faster (since both Java is faster as previous posters have shown and reading from disk will also be faster).</p> <p>But you have to ask yourself how valuable is your time. I find myself far more productive with python, and writing map-reduce that reads STDIN and writes to STDOUT is really straightforward. So I personally would recommend going the python route - even if you have to figure the binary encoding stuff out yourself. Since hadoop 0.21 handles non-utf8 byte arrays, and since there is a binary (byte array) alternative to use for python (<a href="http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/" rel="noreferrer">http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/</a>), which shows the python code only being about 25% slower than the "custom jar" java code, I would definitely go the python route.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload