Note that there are some explanatory texts on larger screens.

plurals
  1. POHow can I group a large dataset
    primarykey
    data
    text
    <p>I have simple text file containing two columns, both integers</p> <pre><code>1 5 1 12 2 5 2 341 2 12 </code></pre> <p>and so on..</p> <p>I need to group the dataset by second value, such that the output will be.</p> <pre><code>5 1 2 12 1 2 341 2 </code></pre> <p>Now the problem is that the file is very big around 34 Gb in size, I tried writing a python script to group them into a dictionary with value as an array of integers, still it takes way too long. (I guess a large time is taken for allocating the <code>array('i')</code> and extending them on <code>append</code>.</p> <p>I am now planning to write a pig script which I am planning to run on a pseudo distributed hadoop machine (An Amazon EC3 High Memory Large instance).</p> <pre><code>data = load 'Net.txt'; gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet store gdata into 'res.txt'; </code></pre> <p>I wanted to know if there was any simpler way of doing this.</p> <p><strong>Update:</strong> keeping such a big file in memory is out of question, In case of python solution, what I planned was to conduct 4 runs in first run only second col values from 1 - 10 million are considered in next run 10 million to 20 million are considered and so on. but this turned out to be really slow.</p> <p>The pig / hadoop solution is interesting because it keeps everything on disk [Well most of it].</p> <p>For better understanding this dataset contains information about connectivity of ~45 Million twitter users and the format in file means that userid given by the second number is following the the first one.</p> <p><strong>Solution which I had used:</strong></p> <pre><code>class AdjDict(dict): """ A special Dictionary Class to hold adjecancy list """ def __missing__(self, key): """ Missing is changed such that when a key is not found an integer array is initialized """ self.__setitem__(key,array.array('i')) return self[key] Adj= AdjDict() for line in file("net.txt"): entry = line.strip().split('\t') node = int(entry[1]) follower = int(entry[0]) if node &lt; 10 ** 6: Adj[node].append(follower) # Code for writting Adj matrix to the file: </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload