StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow to use "typedbytes" or "rawbytes" in Hadoop Streaming?
text
Body
copied!<p>I have a problem that would be solved by Hadoop Streaming in "typedbytes" or "rawbytes" mode, which allow one to analyze binary data in a language other than Java. (Without this, Streaming interprets some characters, usually \t and \n, as delimiters and complains about non-utf-8 characters. Converting all my binary data to Base64 would slow down the workflow, defeating the purpose.)</p> <p>These binary modes were added by <a href="https://issues.apache.org/jira/browse/HADOOP-1722" rel="nofollow">HADOOP-1722</a>. On the command line that invokes a Hadoop Streaming job, "-io rawbytes" lets you define your data as a 32-bit integer size followed by raw data of that size, and "-io typedbytes" lets you define your data as a 1-bit zero (which means raw bytes), followed by a 32-bit integer size, followed by raw data of that size. I have created files with these formats (with one or many records) and verified that they are in the right format by checking them with/against the output of <a href="https://github.com/klbostee/typedbytes" rel="nofollow">typedbytes.py</a>. I've also tried all conceivable variations (big-endian, little-endian, different byte offsets, etc.). I'm using Hadoop <a href="https://ccp.cloudera.com/display/CDH4DOC/Installing+CDH4+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode" rel="nofollow">0.20 from CDH4</a>, which has the classes that implement the typedbytes handling, and it is entering those classes when the "-io" switch is set.</p> <p>I copied the binary file to HDFS with "hadoop fs -copyFromLocal". When I try to use it as input to a map-reduce job, it fails with an OutOfMemoryError on the line where it tries to make a byte array of the length I specify (e.g. 3 bytes). It must be reading the number incorrectly and trying to allocate a huge block instead. Despite this, it does manage to get a record to the mapper (the previous record? not sure), which writes it to standard error so that I can see it. There are always too many bytes at <em>the beginning</em> of the record: for instance, if the file is "\x00\x00\x00\x00\x03hey", the mapper would see "\x04\x00\x00\x00\x00\x00\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\x00\x03hey" (reproducible bits, though no pattern that I can see).</p> <p>From page 5 of <a href="http://static.last.fm/johan/huguk-20090414/klaas-hadoop-1722.pdf" rel="nofollow">this talk</a>, I learned that there are "loadtb" and "dumptb" subcommands of streaming, which copy to/from HDFS and wrap/unwrap the typed bytes in a SequenceFile, in one step. When used with "-inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat", Hadoop correctly unpacks the SequenceFile, but then misinterprets the typedbytes contained within, in exactly the same way.</p> <p>Moreover, I can find no documentation of this feature. On Feb 7 (I e-mailed it to myself), it was briefly mentioned in the <a href="http://hadoop.apache.org/docs/mapreduce/r0.21.0/streaming.html#Specifying+the+Communication+Format" rel="nofollow">streaming.html page on Apache</a>, but this r0.21.0 webpage has since been taken down and <a href="http://hadoop.apache.org/docs/r1.1.1/streaming.html" rel="nofollow">the equivalent page for r1.1.1</a> has no mention of rawbytes or typedbytes.</p> <p><strong>So my question is:</strong> what is the correct way to use rawbytes or typedbytes in Hadoop Streaming? Has anyone ever gotten it to work? If so, could someone post a recipe? It seems like this would be a problem for anyone who wants to use binary data in Hadoop Streaming, which ought to be a fairly broad group.</p> <p>P.S. I noticed that <a href="https://github.com/klbostee/dumbo" rel="nofollow">Dumbo</a>, <a href="http://www.hadoopy.com/en/latest/" rel="nofollow">Hadoopy</a>, and <a href="https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr" rel="nofollow">rmr</a> all use this feature, but there ought to be a way to use it directly, without being mediated by a Python-based or R-based framework.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload