StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PODefining dependencies for MapReduce projects and Oozie workflows
primarykey
Id
10587741
data
AcceptedAnswerId
10589197
AnswerCount
1
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2012-05-14T16:53:43.893
FavoriteCount
0
LastActivityDate
2012-05-14T22:23:47.100
LastEditDate
2012-05-14T22:23:47.100
LastEditorUserId
397935
OwnerUserId
397935
ParentId
0
PostTypeId
1
Score
0
ViewCount
924
LastEditorDisplayName
text
Body
In my company we are developing MapReduce applications on Hadoop. There is a debate going on over dependency management for these projects and I would like to hear you opinion. We are using Cloudera's Hadoop distribution (CDH). Our development workflow: <ul> <li>a MapReduce project is hosted in SVN repos</li> <li>each of them has a POM file with dependencies defined (and some other stuff too)</li> <li>we also create Oozie workflow projects which have these MapReduce projects defined as depenencies in their POM and which are responsible to define the execution flow of the MapReduce projects</li> <li>the build artifact of a Oozie project is a jar file containing all MapReduce jars it uses and their dependencies (we use Maven's assembly plugin to compress it), this is the artifact we later deploy to HDFS (after decompressing)</li> <li>we build the projects with Maven, managed by Jenkins</li> <li>successful builds get deployed to an Archiva server</li> <li>deployment to HDFS is on-demand from Archiva, getting the artifact of the Oozie project build, extracting it and putting it to HDFS</li> <li>some dependencies (namely the ones used by Oozie; Hive, Sqoop, MySQL connector, Jline, commons-..., etc) are not needed for building the projects but they needed for it to work</li> </ul> Still with me? Now the debate is about defining these dependencies of MapReduce and Oozie projects. There are two standpoints. One says it's not needed to define these dependencies (ie. the ones not needed to build the projects) in the POM files, but instead, have them in a shared directory in HDFS and always assume they are there. Pros: <ul> <li>devs don't need to take care of these (however, they take care of some others)</li> <li>most likely, when updating CDH distribution, it's easier to update these in the shared directory than in each project individuality (not sure if this is necessary though)</li> </ul> Cons: <ul> <li>some dependencies are defined for the projects, some are assumed which doesn't feel right</li> <li>the shared directory can become a sink of unused JARs and no one will know which is still used and which not</li> <li>code becomes less portable because it assumes these JARs are always there in HDFS with the right version</li> </ul> So what do you guys think? EDIT: forgot to write, but it's quite obvious, that the 2nd option is to define all dependencies - even if they will repeat for most projects and need some maintenance.
Tags
<maven><hadoop><cloudera>
Title
Defining dependencies for MapReduce projects and Oozie workflows
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USgphilip
UserOwnerUserId
1. USgphilip
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. This table or related slice is empty.
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.