Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>So to clear somethings up:</p> <p>Mahout's XMLInputFormat will process XML files and extract out the XML between two configured start / end tags. So if your XML looks like the following:</p> <pre><code>&lt;main&gt; &lt;person&gt; &lt;name&gt;Bob&lt;/name&gt; &lt;dob&gt;1970/01/01&lt;/dob&gt; &lt;/person&gt; &lt;/main&gt; </code></pre> <p>and you've configured the start / end tags to be <code>&lt;person&gt;</code> and <code>&lt;/person&gt;</code>, then your mapper will be passed the following <code>&lt;LongWritable, Text&gt;</code> pair to its map method:</p> <pre><code>LongWritable: 10 Text: "&lt;person&gt;\n &lt;name&gt;Bob&lt;/name&gt;\n &lt;dob&gt;1970/01/01&lt;/dob&gt;\n &lt;/person&gt;" </code></pre> <p>What you do with this data in your mapper is then up to you.</p> <p>With regards to splits, <code>XmlInputFormat</code> extends <code>TextInputFormat</code>, so if you're input file is splittable (i.e. uncompressed or compressed with a splittable codec such as snappy), then the file will be processed by one or more mappers as follows:</p> <ol> <li>If the input file size (let's say 48 MB) is less than a single block in HDFS (lets say 64MB), and you don't configure min / max split size properties, then you'll get a single mapper to process the file</li> <li>As with the above, but you configure max split size to be 10MB (<code>mapred.max.split.size=10485760</code>), then you'll get 5 map tasks to process the file</li> <li>If the file is bigger than the block size then you'll get a map task for each block, or if the max split size is configured, a map for each part of the file division by that split size</li> </ol> <p>When the file is split up into these block or split sized chunks, the XmlInputFormat will seek to byte address/offset of the block / split boundaries and then scan forwards until it finds either the configured XML start tag or reaches the byte address of the block/split boundary. If it finds the start tag, it will then consume data until it finds the end tag (or end of file). If it finds the end tag a record will be passed to your mapper, otherwise your mapper will not receive any input. To emphasize, the map may scan past the end of the block / split when trying to find the end tag, but will only do this if it has found a start tag, otherwise scanning stops at the end of the block/split.</p> <p>So to (eventually) answer your question, if you haven't configured a mapper (and are using the default or identify mapper as it's also known), then yes, it doesn't matter how big the XML chunk is (MB's, GB's, TB's!) it will be sent to the reducer.</p> <p>I hope this makes sense.</p> <p><strong>EDIT</strong></p> <p>To follow up on your comments:</p> <ol> <li>Yes, each mapper will attempt to process its split (range of bytes) of the file</li> <li>Yes, regardless of what your set the max split size too, your mapper will receive records which represent the data between (inclusive) of the start / end tags. The person element will not be split up not matter what it's size is (obviously if there is GB's of data between the start and end element, you'll most probably run out of memory trying to buffer it into a Text object)</li> <li>Continuing from the above, your data will never be split up between the start and end element, a person element will be sent in its entirity to a mapper, so you should always be ok using something like a SAX parser to further process it without fear that you're only seeing a portion of the person element.</li> </ol>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload