Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Same as my response from Hive mailing list:</p> <p>To answer your questions:</p> <p>1) S3 terminology uses the word "object" and I am sure they have good reasons as to why but for us Hive'ers, an S3 object is the same as a file stored on S3. The complete path to the file would be what Amazon calls the S3 "key" and the corresponding value would be the contents of the file e.g. s3://my_bucket/tables/log.txt would be the key and the actual content of the file would be S3 object. You can use the AWS web console to create a bucket and use tools like S3cmd (http://s3tools.org/s3cmd) to put data onto S3.</p> <p>However, you don't necessarily need to use S3. S3 is typically only used when you want to have a persistent storage of data. Most people would store their input logs/files on S3 for Hive processing and also store the final aggregations and results on S3 for future retrieval. If you are just temporarily loading some data into Hive, processing it and exporting it out, you don't have to worry about S3. The nodes that form your cluster have ephemeral storage that forms the HDFS. You can just use that. The only side effect is that you will loose all your data in HDFS once you terminate the cluster. If that's ok, don't worry about S3.</p> <p>EMR instances are basically EC2 instances with some additional setup done on them. Transferring data between EC2 and EMR instances should be simple, I'd think. If your data is present in EBS volumes, you could look into adding an EMR bootstrap action that mounts that same EBS volume onto your EMR instances. It might be easier if you can do it without all the fancy mounting business though. </p> <p>Also, keep in mind that there might be costs for data transfers across Amazon data centers, you would want to keep your S3 buckets, EMR cluster and EC2 instances in the same region, if at all possible. Within the same region, there shouldn't be any extra transfer costs.</p> <p>2) Yeah, EMR supports custom jars. You can specify them at the time you create your cluster. This should require minimal porting changes to your jar itself since it runs on Hadoop and Hive which are the same as (well, close enough to) what you installed your local cluster vs. what's installed on EMR.</p> <p>3) Sqoop with EMR should be OK.</p> <p>References: <a href="http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/%3CCAGif4YQv1RVSoLt+Yqn8C1jDN3ukLHZ_J+GMFDoPCbcXO7W2tw@mail.gmail.com%3E" rel="nofollow">http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/%3CCAGif4YQv1RVSoLt+Yqn8C1jDN3ukLHZ_J+GMFDoPCbcXO7W2tw@mail.gmail.com%3E</a></p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload