Note that there are some explanatory texts on larger screens.

plurals
  1. POWriting to multiple HCatalog schemas in single reducer?
    text
    copied!<p>I have a set of Hadoop flows that were written before we started using Hive. When we added Hive, we configured the data files as external tables. Now we're thinking about rewriting the flows to output their results using HCatalog. Our main motivation to make the change is to take advantage of the dynamic partitioning.</p> <p>One of the hurdles I'm running into is that some of our reducers generate multiple data sets. Today this is done with side-effect files, so we write out each record type to its own file in a single reduce step, and I'm wondering what my options are to do this with HCatalog.</p> <p>One option obviously is to have each job generate just a single record type, reprocessing the data once for each type. I'd like to avoid this.</p> <p>Another option for some jobs is to change our schema so that all records are stored in a single schema. Obviously this option works well if the data was just broken apart for poor-man's partitioning, since HCatalog will take care of partitioning the data based on the fields. For other jobs, however, the types of records are not consistent.</p> <p>It seems that I might be able to use the Reader/Writer interfaces to pass a set of writer contexts around, one per schema, but I haven't really thought it through (and I've only been looking at HCatalog for a day, so I may be misunderstanding the Reader/Writer interface).</p> <p>Does anybody have any experience writing to multiple schemas in a single reduce step? Any pointers would be much appreciated.</p> <p>Thanks.</p> <p>Andrew</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload