Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>You do not need to have a reducer. You can set the number of reducers to 0 in the job configuration stage, eg</p> <pre><code>job.setNumReduceTasks(0); </code></pre> <p>Also, to ensure that each mapper processes one complete input file, you can tell hadoop that the input files are not splitable. The FileInputFormat has a method </p> <pre><code>protected boolean isSplitable(JobContext context, Path filename) </code></pre> <p>that can be used to mark a file as not splittable, which means it will be processed by a single mapper. See <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html#isSplitable%28org.apache.hadoop.mapreduce.JobContext,%20org.apache.hadoop.fs.Path%29" rel="nofollow">here</a> for documentation. I just re-read your question, and realised that your input is probably a file with a list of filenames in it, so you most likely want to split it or it will only be run by one mapper.</p> <p>What I would do in your situation is have an input which is a list of file names in s3. The mapper input is then a file name, which it downloads and runs your exe against. The output of this exe run is then uploaded to s3, and the mapper moves on to the next file. The mapper then does not need to output anything. Though it might be a good idea to output the file name processed so you can check against the input afterwards. Using the method I just outlined, you would not need to use the isSplitable method.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload