Note that there are some explanatory texts on larger screens.

plurals
  1. POBroken Pipe Error causes streaming Elastic MapReduce job on AWS to fail
    primarykey
    data
    text
    <p>Everything works fine locally when I do as follows:</p> <pre><code>cat input | python mapper.py | sort | python reducer.py </code></pre> <p>However, when I run the streaming MapReduce job on AWS Elastic Mapreduce, the job does not complete successfully. The <code>mapper.py</code> runs part way through (I know this because of writing to <code>stderr</code> along the way). The mapper is interrupted by a "Broken Pipe" error, which I'm able to retrieve from the syslog of the task attempt after it fails:</p> <pre><code>java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72) at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51) at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:109) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249) 2012-03-26 07:19:05,400 WARN org.apache.hadoop.streaming.PipeMapRed (main): java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:579) at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:124) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249) 2012-03-26 07:19:05,400 INFO org.apache.hadoop.streaming.PipeMapRed (main): mapRedFinished 2012-03-26 07:19:05,400 WARN org.apache.hadoop.streaming.PipeMapRed (main): java.io.IOException: Bad file descriptor at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:579) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249) 2012-03-26 07:19:05,400 INFO org.apache.hadoop.streaming.PipeMapRed (main): mapRedFinished 2012-03-26 07:19:05,405 INFO org.apache.hadoop.streaming.PipeMapRed (Thread-13): MRErrorThread done 2012-03-26 07:19:05,408 INFO org.apache.hadoop.mapred.TaskLogsTruncater (main): Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2012-03-26 07:19:05,519 INFO org.apache.hadoop.io.nativeio.NativeIO (main): Initialized cache for UID to User mapping with a cache timeout of 14400 seconds. 2012-03-26 07:19:05,520 INFO org.apache.hadoop.io.nativeio.NativeIO (main): Got UserName hadoop for UID 106 from the native implementation 2012-03-26 07:19:05,522 WARN org.apache.hadoop.mapred.Child (main): Error running child java.io.IOException: log:null R/W/S=7018/3/0 in:NA [rec/s] out:NA [rec/s] minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null HOST=null USER=hadoop HADOOP_USER=null last Hadoop input: |null| last tool output: |text/html 1| Date: Mon Mar 26 07:19:05 UTC 2012 java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72) at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51) at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:109) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249) at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:125) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249) 2012-03-26 07:19:05,525 INFO org.apache.hadoop.mapred.Task (main): Runnning cleanup for the task 2012-03-26 07:19:05,526 INFO org.apache.hadoop.mapred.DirectFileOutputCommitter (main): Nothing to clean up on abort since there are no temporary files written </code></pre> <p>Here is <code>mapper.py</code>. Note that I write to stderr to provide myself with debugging info:</p> <pre><code>#!/usr/bin/env python import sys from warc import ARCFile def main(): warc_file = ARCFile(fileobj=sys.stdin) for web_page in warc_file: print &gt;&gt; sys.stderr, '%s\t%s' % (web_page.header.content_type, 1) #For debugging print '%s\t%s' % (web_page.header.content_type, 1) print &gt;&gt; sys.stderr, 'done' #For debugging if __name__ == "__main__": main() </code></pre> <p>Here is what I get in the stderr for the task attempt when the mapper.py is run:</p> <pre><code>text/html 1 text/html 1 text/html 1 </code></pre> <p>Basically, the loop runs through 3 times and then stops abruptly without python throwing any error. (Note: it <em>should</em> be outputting thousands of lines). Even an uncaught exception should appear in stderr. </p> <p>Because the MapReduce runs completely fine on my local computer, my guess is that this is a problem with how Hadoop is dealing with the output I'm printing from mapper.py. But I'm clueless as to what the problem could be.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload