Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>It does seem tempting to think that using a processing pool will solve problems like this, but it's going to end up being a good bit more complicated than that, at least in pure Python.</p> <p>Because the OP mentioned that the lists on each input line would be longer in practice than two elements, I made a slightly-more-realistic input file using :</p> <pre><code>paste &lt;(seq 20000000) &lt;(seq 2 20000001) &lt;(seq 3 20000002) | head -1000000 &gt; largefile.txt </code></pre> <p>After profiling the original code, I found the slowest portion of the process to be the line-splitting routine. (<code>.split()</code> took approximately 2x longer than <code>.append()</code> on my machine.)</p> <pre><code>1000000 0.333 0.000 0.333 0.000 {method 'split' of 'str' objects} 1000000 0.154 0.000 0.154 0.000 {method 'append' of 'list' objects} </code></pre> <p>So I factored the split into another function and use a pool to distribute the work of splitting the fields :</p> <pre><code>import sys import collections import multiprocessing as mp d = collections.defaultdict(list) def split(l): return l.split() pool = mp.Pool(processes=4) for keys in pool.map(split, open(sys.argv[1])): d[keys[0]].append(keys[1:]) </code></pre> <p>Unfortunately, adding the pool slowed things down by more than 2x. The original version looked like this :</p> <pre><code>$ time python process.py smallfile.txt real 0m7.170s user 0m6.884s sys 0m0.260s </code></pre> <p>versus the parallel version :</p> <pre><code>$ time python process-mp.py smallfile.txt real 0m16.655s user 0m24.688s sys 0m1.380s </code></pre> <p>Because the <code>.map()</code> call basically has to serialize (pickle) each input, send it to the remote process, and then deserialize (unpickle) the return value from the remote process, using a pool in this way is much slower. You do get some improvement by adding more cores to the pool, but I'd argue that this is fundamentally the wrong way to distribute this work.</p> <p>To really speed this up across cores, my guess is that you'd need to read in large chunks of the input using some sort of fixed block size. Then you could send the entire block to a worker process and get serialized lists back (though it's still unknown how much the deserialization here will cost you). Reading the input in fixed-size blocks sounds like it might be tricky with the anticipated input, however, since my guess is that each line isn't necessarily the same length.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload