Note that there are some explanatory texts on larger screens.

plurals
  1. POCreating a Sequence File outside Java Hadoop Framework
    primarykey
    data
    text
    <p>I have been experimenting with generating sequence files for Hadoop outside the Java framework, Python to be specific. There is a python-hadoop module which provides mostly similar framework to do this. I have successfully created sequence files using it; the generated sequence files can be copied to HDF and be used as input for Hadoop jobs. LZO and Snappy are fully configured on my local Hadoop installation, and I can generate proper compressed sequence files with those algorithms when I do so via org.apache.hadoop.io.SequenceFile.createWriter on Java.</p> <p>However, it seems that valid sequence files are not generated when I try LZO or Snappy as the (block) compression scheme on python-hadoop. I'm using a similar scheme as in this code:</p> <p><a href="https://github.com/fenriswolf/python-hadoop/blob/master/python-hadoop/hadoop/io/compress/LzoCodec.py" rel="nofollow">https://github.com/fenriswolf/python-hadoop/blob/master/python-hadoop/hadoop/io/compress/LzoCodec.py</a></p> <p>(where I replace lzo with snappy for Snappy compression), and within the python-hadoop frame work those files can be written and read without any errors. On Hadoop, however, I get EOF errors when I feed them as Hadoop input:</p> <pre><code>Exception in thread "main" java.io.EOFException at org.apache.hadoop.io.compress.BlockDecompressorStream.rawReadInt(BlockDecompressorStream.java:126) at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:98) at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:64) at java.io.DataInputStream.readByte(DataInputStream.java:265) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:299) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:320) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1911) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1934) at SequenceFileReadDemo.main(SequenceFileReadDemo.java:34) </code></pre> <p>I have consistently seen this particular message only when I use LZO or Snappy. </p> <p>My suspicion is that LzoCodec and SnappyCodec in Hadoop aren't generating or reading in the same way as Python's implementations in lzo and snappy, but I'm not sure what they should be.</p> <p>Is there any reason why sequence files with those compression schemes are not generated properly outside the Java Hadoop framework? Again, the whole thing works fine so long I use Gzip, BZip2, or Default.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload