Note that there are some explanatory texts on larger screens.

plurals
  1. PODefining dependencies for MapReduce projects and Oozie workflows
    primarykey
    data
    text
    <p>In my company we are developing MapReduce applications on Hadoop. There is a debate going on over dependency management for these projects and I would like to hear you opinion.</p> <p>We are using Cloudera's Hadoop distribution (CDH).</p> <p>Our development workflow:</p> <ul> <li>a MapReduce project is hosted in SVN repos</li> <li>each of them has a POM file with dependencies defined (and some other stuff too)</li> <li>we also create Oozie workflow projects which have these MapReduce projects defined as depenencies in their POM and which are responsible to define the execution flow of the MapReduce projects</li> <li>the build artifact of a Oozie project is a jar file containing all MapReduce jars it uses and their dependencies (we use Maven's assembly plugin to compress it), this is the artifact we later deploy to HDFS (after decompressing)</li> <li>we build the projects with Maven, managed by Jenkins</li> <li>successful builds get deployed to an Archiva server</li> <li>deployment to HDFS is on-demand from Archiva, getting the artifact of the Oozie project build, extracting it and putting it to HDFS</li> <li>some dependencies (namely the ones used by Oozie; Hive, Sqoop, MySQL connector, Jline, commons-..., etc) are not needed for building the projects but they needed for it to work</li> </ul> <p>Still with me?</p> <p>Now the debate is about defining these dependencies of MapReduce and Oozie projects. There are two standpoints.</p> <p>One says it's not needed to define these dependencies (ie. the ones not needed to build the projects) in the POM files, but instead, have them in a shared directory in HDFS and always assume they are there.</p> <p>Pros:</p> <ul> <li>devs don't need to take care of these (however, they take care of some others)</li> <li>most likely, when updating CDH distribution, it's easier to update these in the shared directory than in each project individuality (not sure if this is necessary though)</li> </ul> <p>Cons:</p> <ul> <li>some dependencies are defined for the projects, some are assumed which doesn't feel right</li> <li>the shared directory can become a sink of unused JARs and no one will know which is still used and which not</li> <li>code becomes less portable because it assumes these JARs are always there in HDFS with the right version</li> </ul> <p>So what do you guys think?</p> <p>EDIT: forgot to write, but it's quite obvious, that the 2nd option is to define all dependencies - even if they will repeat for most projects and need some maintenance.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload