StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POBroken Pipe Error causes streaming Elastic MapReduce job on AWS to fail
primarykey
Id
9881269
data
AcceptedAnswerId
9920198
AnswerCount
3
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2012-03-26T23:15:07.553
FavoriteCount
4
LastActivityDate
2014-08-14T16:57:26.490
LastEditDate
LastEditorUserId
0
OwnerUserId
378622
ParentId
0
PostTypeId
1
Score
7
ViewCount
7822
LastEditorDisplayName
text
Body
Everything works fine locally when I do as follows: <pre><code>cat input | python mapper.py | sort | python reducer.py </code></pre> However, when I run the streaming MapReduce job on AWS Elastic Mapreduce, the job does not complete successfully. The <code>mapper.py</code> runs part way through (I know this because of writing to <code>stderr</code> along the way). The mapper is interrupted by a "Broken Pipe" error, which I'm able to retrieve from the syslog of the task attempt after it fails: <pre><code>java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72) at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51) at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:109) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249) 2012-03-26 07:19:05,400 WARN org.apache.hadoop.streaming.PipeMapRed (main): java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:579) at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:124) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249) 2012-03-26 07:19:05,400 INFO org.apache.hadoop.streaming.PipeMapRed (main): mapRedFinished 2012-03-26 07:19:05,400 WARN org.apache.hadoop.streaming.PipeMapRed (main): java.io.IOException: Bad file descriptor at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:579) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249) 2012-03-26 07:19:05,400 INFO org.apache.hadoop.streaming.PipeMapRed (main): mapRedFinished 2012-03-26 07:19:05,405 INFO org.apache.hadoop.streaming.PipeMapRed (Thread-13): MRErrorThread done 2012-03-26 07:19:05,408 INFO org.apache.hadoop.mapred.TaskLogsTruncater (main): Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2012-03-26 07:19:05,519 INFO org.apache.hadoop.io.nativeio.NativeIO (main): Initialized cache for UID to User mapping with a cache timeout of 14400 seconds. 2012-03-26 07:19:05,520 INFO org.apache.hadoop.io.nativeio.NativeIO (main): Got UserName hadoop for UID 106 from the native implementation 2012-03-26 07:19:05,522 WARN org.apache.hadoop.mapred.Child (main): Error running child java.io.IOException: log:null R/W/S=7018/3/0 in:NA [rec/s] out:NA [rec/s] minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null HOST=null USER=hadoop HADOOP_USER=null last Hadoop input: |null| last tool output: |text/html 1| Date: Mon Mar 26 07:19:05 UTC 2012 java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72) at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51) at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:109) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249) at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:125) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249) 2012-03-26 07:19:05,525 INFO org.apache.hadoop.mapred.Task (main): Runnning cleanup for the task 2012-03-26 07:19:05,526 INFO org.apache.hadoop.mapred.DirectFileOutputCommitter (main): Nothing to clean up on abort since there are no temporary files written </code></pre> Here is <code>mapper.py</code>. Note that I write to stderr to provide myself with debugging info: <pre><code>#!/usr/bin/env python import sys from warc import ARCFile def main(): warc_file = ARCFile(fileobj=sys.stdin) for web_page in warc_file: print >> sys.stderr, '%s\t%s' % (web_page.header.content_type, 1) #For debugging print '%s\t%s' % (web_page.header.content_type, 1) print >> sys.stderr, 'done' #For debugging if __name__ == "__main__": main() </code></pre> Here is what I get in the stderr for the task attempt when the mapper.py is run: <pre><code>text/html 1 text/html 1 text/html 1 </code></pre> Basically, the loop runs through 3 times and then stops abruptly without python throwing any error. (Note: it should be outputting thousands of lines). Even an uncaught exception should appear in stderr. Because the MapReduce runs completely fine on my local computer, my guess is that this is a problem with how Hadoop is dealing with the output I'm printing from mapper.py. But I'm clueless as to what the problem could be.
Tags
<python><hadoop><amazon-web-services><mapreduce><elastic-map-reduce>
Title
Broken Pipe Error causes streaming Elastic MapReduce job on AWS to fail
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USbgcode
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POBroken Pipe Error causes streaming Elastic MapReduce job on AWS to fail
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POBroken Pipe Error causes streaming Elastic MapReduce job on AWS to fail
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POBroken Pipe Error causes streaming Elastic MapReduce job on AWS to fail
 UserUserId
 USMatt Andersen
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.