Note that there are some explanatory texts on larger screens.

plurals
  1. PORead large file in parallel?
    primarykey
    data
    text
    <p>I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minimal example that shows the problem.</p> <p>First make some fake data</p> <pre><code>paste &lt;(seq 20000000) &lt;(seq 2 20000001) &gt; largefile.txt </code></pre> <p>Now here is a minimal piece of python code to read it in and make a dictionary.</p> <pre><code>import sys from collections import defaultdict fin = open(sys.argv[1]) dict = defaultdict(list) for line in fin: parts = line.split() dict[parts[0]].append(parts[1]) </code></pre> <p>Timings:</p> <pre><code>time ./read.py largefile.txt real 0m55.746s </code></pre> <p>However it is possible to read the whole file much faster as:</p> <pre><code>time cut -f1 largefile.txt &gt; /dev/null real 0m1.702s </code></pre> <blockquote> <p>My CPU has 8 cores, is it possible to parallelize this program in python to speed it up?</p> </blockquote> <p>One possibility might be to read in large chunks of the input and then run 8 processes in parallel on different non-overlapping subchunks making dictionaries in parallel from the data in memory then read in another large chunk. Is this possible in python using multiprocessing somehow?</p> <p><strong>Update</strong>. The fake data was not very good as it had only one value per key. Better is</p> <pre><code>perl -E 'say int rand 1e7, $", int rand 1e4 for 1 .. 1e7' &gt; largefile.txt </code></pre> <p>(Related to <a href="https://stackoverflow.com/questions/18086424/read-in-large-file-and-make-dictionary">Read in large file and make dictionary</a> .)</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload