Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Okay, I've found a combination that works, but it's weird.</p> <ol> <li><p>Prepare a valid typedbytes file in your local filesystem, following the <a href="http://hadoop.apache.org/docs/current/api/org/apache/hadoop/typedbytes/package-summary.html" rel="nofollow">documentation</a> or by imitating <a href="https://github.com/klbostee/typedbytes" rel="nofollow">typedbytes.py</a>.</p></li> <li><p>Use</p> <pre><code>hadoop jar path/to/streaming.jar loadtb path/on/HDFS.sequencefile &lt; local/typedbytes.tb </code></pre> <p>to wrap the typedbytes in a SequenceFile and put it in HDFS, in one step.</p></li> <li><p>Use</p> <pre><code>hadoop jar path/to/streaming.jar -inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat ... </code></pre> <p>to run a map-reduce job in which the mapper gets input from the SequenceFile. Note that <code>-io typedbytes</code> or <code>-D stream.map.input=typedbytes</code> should <em>not</em> be used--- explicitly asking for typedbytes leads to the misinterpretation I described in my question. But fear not: Hadoop Streaming splits the input on its binary record boundaries and not on its '\n' characters. The data arrive in the mapper as "rawdata" separated by '\t' and '\n', like this:</p> <ol> <li>32-bit signed integer, representing length (note: no type character)</li> <li>block of raw binary with that length: this is the key</li> <li>'\t' (tab character... why?)</li> <li>32-bit signed integer, representing length</li> <li>block of raw binary with that length: this is the value</li> <li>'\n' (newline character... ?)</li> </ol></li> <li><p>If you want to additionally send raw data from mapper to reducer, add</p> <pre><code>-D stream.map.output=typedbytes -D stream.reduce.input=typedbytes </code></pre> <p>to your Hadoop command line and format the mapper's output and reducer's expected input as valid typedbytes. They also alternate for key-value pairs, but this time with type characters and without '\t' and '\n'. Hadoop Streaming correctly splits these pairs on their binary record boundaries and groups by keys.</p></li> </ol> <p>The only documentation on <code>stream.map.output</code> and <code>stream.reduce.input</code> that I could find was in the <a href="https://issues.apache.org/jira/browse/HADOOP-1722" rel="nofollow">HADOOP-1722</a> exchange, starting 6 Feb 09. (Earlier discussion considered a different way to parameterize the formats.)</p> <p>This recipe does not provide strong typing for the input: the type characters are lost somewhere in the process of creating a SequenceFile and interpreting it with the <code>-inputformat</code>. It does, however, provide splitting at the binary record boundaries, rather than '\n', which is the really important thing, and strong typing between the mapper and the reducer.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload