Note that there are some explanatory texts on larger screens.

plurals
  1. POWhat's the best way to divide large files in Python for multiprocessing?
    primarykey
    data
    text
    <p>I run across a lot of "embarrassingly parallel" projects I'd like to parallelize with the <code>multiprocessing</code> module. However, they often involve reading in huge files (greater than 2gb), processing them line by line, running basic calculations, and then writing results. What's the best way to split a file and process it using Python's multiprocessing module? Should <code>Queue</code> or <code>JoinableQueue</code> in <code>multiprocessing</code> be used? Or the <code>Queue</code> module itself? Or, should I map the file iterable over a pool of processes using <code>multiprocessing</code>? I've experimented with these approaches but the overhead is immense in distribution the data line by line. I've settled on a lightweight pipe-filters design by using <code>cat file | process1 --out-file out1 --num-processes 2 | process2 --out-file out2</code>, which passes a certain percentage of the first process's input directly to the second input (see <a href="https://stackoverflow.com/questions/659865/python-multiprocessing-sharing-a-large-read-only-object-between-processes/659888#659888">this post</a>), but I'd like to have a solution contained entirely in Python. </p> <p>Surprisingly, the Python documentation doesn't suggest a canonical way of doing this (despite a lengthy section on programming guidelines in the <code>multiprocessing</code> documentation).</p> <p>Thanks, Vince</p> <p>Additional information: Processing time per line varies. Some problems are fast and barely not I/O bound, some are CPU-bound. The CPU bound, non-dependent tasks will gain the post from parallelization, such that even inefficient ways of assigning data to a processing function would still be beneficial in terms of wall clock time.</p> <p>A prime example is a script that extracts fields from lines, checks for a variety of bitwise flags, and writes lines with certain flags to a new file in an entirely new format. This seems like an I/O bound problem, but when I ran it with my cheap concurrent version with pipes, it was about 20% faster. When I run it with pool and map, or queue in <code>multiprocessing</code> it is always over 100% slower. </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload