Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to optimize read performance of S3 when using GZip input files for Hadoop
    primarykey
    data
    text
    <p>I am getting very dismal performance at the first step of my Hadoop streaming job: it seems to me that the mappers read from S3 around 40KB/s - 50KB/s.</p> <p><strong>It takes over an hour for ~100MB of data to be read from S3</strong>!</p> <p>The way that data is stored: thousands of ~5-10KB GZip files in the S3 bucket.</p> <p>I recently decompressed all the files of a sample 100MB dataset and uploaded it as a single GZip file in the same S3 bucket, and my task finished in 3 mins (vs previous 1 hour runs)</p> <p>Encouraged, I decompressed all the files of a sample 2GB dataset and uploaded it as a single GZip file in the same S3 bucket, and <strong>my task again took more than 1 hour: after which I terminated the task</strong>.</p> <p>I have not played around with <code>mapred.min.split.size</code> and <code>mapred.max.split.size</code>, but I need some sample values to start playing around with.</p> <p>From the posts I read on the internet though, it seems processing GBs of data with GZip input files for Hadoop streaming tasks does not incur a lot of penalty as far as reading them off S3 is concerned.</p> <p>Could you share:</p> <ol> <li>the "blob size" of the files you store on S3 and</li> <li>how many of those you process per task and</li> <li>how long processing those take?</li> </ol> <p>I am guessing tuning the <code>mapred.min.split.size</code> and <code>mapred.max.split.size</code> and keeping the above 3 values optimal regards to S3 will make a lot of change in the time of execution of the jobs.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload