StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow can I group a large dataset
primarykey
Id
3403800
data
AcceptedAnswerId
0
AnswerCount
5
ClosedDate
CommentCount
1
CommunityOwnedDate
CreationDate
2010-08-04T08:27:05.720
FavoriteCount
1
LastActivityDate
2011-03-26T06:40:34.537
LastEditDate
2010-08-04T12:39:42.413
LastEditorUserId
410522
OwnerUserId
410522
ParentId
0
PostTypeId
1
Score
4
ViewCount
625
LastEditorDisplayName
text
Body
I have simple text file containing two columns, both integers <pre><code>1 5 1 12 2 5 2 341 2 12 </code></pre> and so on.. I need to group the dataset by second value, such that the output will be. <pre><code>5 1 2 12 1 2 341 2 </code></pre> Now the problem is that the file is very big around 34 Gb in size, I tried writing a python script to group them into a dictionary with value as an array of integers, still it takes way too long. (I guess a large time is taken for allocating the <code>array('i')</code> and extending them on <code>append</code>. I am now planning to write a pig script which I am planning to run on a pseudo distributed hadoop machine (An Amazon EC3 High Memory Large instance). <pre><code>data = load 'Net.txt'; gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet store gdata into 'res.txt'; </code></pre> I wanted to know if there was any simpler way of doing this. Update: keeping such a big file in memory is out of question, In case of python solution, what I planned was to conduct 4 runs in first run only second col values from 1 - 10 million are considered in next run 10 million to 20 million are considered and so on. but this turned out to be really slow. The pig / hadoop solution is interesting because it keeps everything on disk [Well most of it]. For better understanding this dataset contains information about connectivity of ~45 Million twitter users and the format in file means that userid given by the second number is following the the first one. Solution which I had used: <pre><code>class AdjDict(dict): """ A special Dictionary Class to hold adjecancy list """ def __missing__(self, key): """ Missing is changed such that when a key is not found an integer array is initialized """ self.__setitem__(key,array.array('i')) return self[key] Adj= AdjDict() for line in file("net.txt"): entry = line.strip().split('\t') node = int(entry[1]) follower = int(entry[0]) if node < 10 ** 6: Adj[node].append(follower) # Code for writting Adj matrix to the file: </code></pre>
Tags
<python><data-structures><hadoop><apache-pig>
Title
How can I group a large dataset
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USlargescaled
UserOwnerUserId
1. USlargescaled
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POHow can I group a large dataset
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POHow can I group a large dataset
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POHow can I group a large dataset
 UserUserId
 USsarnold
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId
1. COThis reminds me of my SO question, http://stackoverflow.com/questions/3357510/generating-bigram-combinations-from-grouped-data-in-pig Sounds like we might be working on similar things. I have built this functionality using map reduce in python by breaking it down into two map-reduce jobs. I'm looking to find a way to do this in pig. Im happy to share my existing work if you're interested.
 singulars
 PostPostId
 POHow can I group a large dataset
 UserUserId
 USNeil Kodner

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.