Note that there are some explanatory texts on larger screens.

plurals
  1. POHadoop: Cost of reading non-local data from other Datanodes
    primarykey
    data
    text
    <p>By default, Hadoop splits the files to be processed by a Mapper on the file's block boundaries. That is, that's what the FileInputFormat implementation does for getSplits(). Hadoop then makes sure that the blocks to be processed by a Mapper are replicated on the Datanode the Mapper runs on.</p> <p>Now I'm wondering, if I need to read outside of this InputSplit (in a RecordReader, but that's irrelevant), what does this cost me as opposed to reading inside the InputSplit - Assuming that the data outside of it is not present on the reading Datanode?</p> <p><strong>EDIT:</strong></p> <p>In other words: <strong>I am a RecordReader</strong> and have been assigned an <strong>InputSplit that spans one file block</strong>. I have a local copy of this file block (rather, the datanode I'm running on does), but not the rest of the file. Now I <strong>do need to read outside of this InputSplit</strong>, because I need to read the <strong>file header</strong> which is at the very beginning. Then I need to skip across records in the file (by reading just the records headers which tells me how long each record is and than skipping that amount of bytes). I need to do this until I encounter the first record that's inside the InputSplit. Then I can start reading the actual records within my InputSplit. That is the only way to make sure that I will start at a valid record boundary.</p> <p><strong>Question</strong>: When I do read outside of the InputSplit, when is the data from the non-local file blocks copied? Is this done one byte at a time (i.e. once per call of InputStream.read()), or is the entire file block (of the current InputStream position) copied to my local datanode once I call InputStream.read() until I encounter the next non-local file block, etc? I need to know this so I can estimate how much overhead will be produced by skipping through the file.</p> <p>Thanks :)</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload