StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
10307799
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2012-04-25T00:21:48.980
FavoriteCount
0
LastActivityDate
2012-04-25T00:21:48.980
LastEditDate
LastEditorUserId
0
OwnerUserId
1354190
ParentId
10276573
PostTypeId
2
Score
2
ViewCount
0
LastEditorDisplayName
text
Body
Same as my response from Hive mailing list: To answer your questions: 1) S3 terminology uses the word "object" and I am sure they have good reasons as to why but for us Hive'ers, an S3 object is the same as a file stored on S3. The complete path to the file would be what Amazon calls the S3 "key" and the corresponding value would be the contents of the file e.g. s3://my_bucket/tables/log.txt would be the key and the actual content of the file would be S3 object. You can use the AWS web console to create a bucket and use tools like S3cmd (http://s3tools.org/s3cmd) to put data onto S3. However, you don't necessarily need to use S3. S3 is typically only used when you want to have a persistent storage of data. Most people would store their input logs/files on S3 for Hive processing and also store the final aggregations and results on S3 for future retrieval. If you are just temporarily loading some data into Hive, processing it and exporting it out, you don't have to worry about S3. The nodes that form your cluster have ephemeral storage that forms the HDFS. You can just use that. The only side effect is that you will loose all your data in HDFS once you terminate the cluster. If that's ok, don't worry about S3. EMR instances are basically EC2 instances with some additional setup done on them. Transferring data between EC2 and EMR instances should be simple, I'd think. If your data is present in EBS volumes, you could look into adding an EMR bootstrap action that mounts that same EBS volume onto your EMR instances. It might be easier if you can do it without all the fancy mounting business though. Also, keep in mind that there might be costs for data transfers across Amazon data centers, you would want to keep your S3 buckets, EMR cluster and EC2 instances in the same region, if at all possible. Within the same region, there shouldn't be any extra transfer costs. 2) Yeah, EMR supports custom jars. You can specify them at the time you create your cluster. This should require minimal porting changes to your jar itself since it runs on Hadoop and Hive which are the same as (well, close enough to) what you installed your local cluster vs. what's installed on EMR. 3) Sqoop with EMR should be OK. References: <a href="http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/%3CCAGif4YQv1RVSoLt+Yqn8C1jDN3ukLHZ_J+GMFDoPCbcXO7W2tw@mail.gmail.com%3E" rel="nofollow">http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/%3CCAGif4YQv1RVSoLt+Yqn8C1jDN3ukLHZ_J+GMFDoPCbcXO7W2tw@mail.gmail.com%3E</a>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POQueries related to running Hive/Sqoop on Amazon EMR?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USMark Grover
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTBountyClose
CommentsPostId
1. COThats Fine. But I have huge data in MSSQL-SERVER (near about in GB's). If I have to run the job daily/weekly basis, then is it efficient to import from SQL-SERVER daily/weekly. If I think to come out this issue and stored this data n S3 then How could I make link between the HDFS and S3. (Because Hive table's data are stored in HDFS in /user/hive/warehouse directory). I ask this beacause my data in SQL Server may get change daily/weekly.
 singulars
 PostPostId
 PO
 UserUserId
 USBhavesh Shah
2. COEMR's version of Hive natively supports S3 paths. Therefore, instead of using hdfs://path/to/my/file you can use s3://mybucket/path/to/my/file
 singulars
 PostPostId
 PO
 UserUserId
 USMark Grover

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.