Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Here are your answers.</p> <ol> <li><p>The mapper and reducer classes can be in separate Java classes, anywhere in the package structure or may in seperate jar files as long as the class loader of the MapTask/ReduceTask is able to load the mapper/reducer classes. The example that you shown is for a quick testing for Hadoop beginners.</p></li> <li><p>Yes, you can use any Java libraries. These third party jars should be made available to the MapTask/ReduceTask either through the <code>-files</code> option of <code>hadoop jar</code> command or using Hadoop API. Look at this link <a href="http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/" rel="noreferrer">here </a> For more information on adding third party libraries to Map/Reduce classpath</p></li> <li><p>Yes, you can configure and pass in the configurations to the Map/Reduce Jobs using either of these approaches.</p> <p>3.1 Use the <code>org.apache.hadoop.conf.Configuration</code> object as below to set the configurations in the client program (the Java class with <code>main()</code> method</p> <p><code>Configuration conf = new Configuration(); conf.set("config1", "value1"); Job job = new Job(conf, "Whole File input");</code></p></li> </ol> <p>The Map/Reduce programs have access to the Configuration object and get the values set for the properties using <code>get()</code> method. This approach is advisable if the configuration settings are small.</p> <p>3.2 Use the distributed cache to load the configurations and make it available in the Map/Reduce programs. Click <a href="http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.htm" rel="noreferrer">here</a> for details on distributed Cache. This approach is more advisable. </p> <p>4.The <code>main()</code> is the client program which should be responsible for configuring and submitting the Hadoop job. If none of the configurations set, then the default settings will be used. The configurations such as Mapper class, Reducer Class, Input Path, Output path, Input Format class, Number of reducers etc. For eg:</p> <p>Additionally, look at the documentation <a href="http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html#Job+Configuration" rel="noreferrer">here</a> on Job configuration</p> <p>Yes, Map/Reduce programs are still a JavaSE programs however, these are distributed across the machines in the Hadoop cluster. Lets say, the Hadoop cluster has 100 nodes and submitted the word count example. The Hadoop framework creates Java process for each of these Map and Reduce tasks and calls the call back methods such as <code>map()/reduce()</code> on subset of machines where the data exists. Essentially, the your mapper/reducer code gets executed on the machine where data exists. I would recommend you to read the Chapter 6 of <a href="http://rads.stackoverflow.com/amzn/click/0596521979" rel="noreferrer">The Definitive Guide</a> </p> <p>I hope, this helps.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload