StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Old School.</p> <p>p1.py</p> <pre><code>import csv import pickle import sys with open( "someFile", "rb" ) as source: rdr = csv.reader( source ) for line in eumerate( rdr ): pickle.dump( line, sys.stdout ) </code></pre> <p>p2.py</p> <pre><code>import pickle import sys while True: try: i, row = pickle.load( sys.stdin ) except EOFError: break pickle.dump( i, sum(row) ) </code></pre> <p>p3.py</p> <pre><code>import pickle import sys while True: try: i, row = pickle.load( sys.stdin ) except EOFError: break print i, row </code></pre> <p>Here's the multi-processing final structure.</p> <pre><code>python p1.py | python p2.py | python p3.py </code></pre> <p>Yes, the shell has knit these together at the OS level. It seems simpler to me and it works very nicely. </p> <p>Yes, there's slightly more overhead in using pickle (or cPickle). The simplification, however, seems worth the effort.</p> <p>If you want the filename to be an argument to <code>p1.py</code>, that's an easy change.</p> <p>More importantly, a function like the following is very handy.</p> <pre><code>def get_stdin(): while True: try: yield pickle.load( sys.stdin ) except EOFError: return </code></pre> <p>That allows you to do this:</p> <pre><code>for item in get_stdin(): process item </code></pre> <hr> <p>This is very simple, but it does not <em>easily</em> allow you to have multiple copies of P2.py running.</p> <p>You have two problems: fan-out and fan-in. The P1.py must somehow fan out to multiple P2.py's. And the P2.py's must somehow merge their results into a single P3.py.</p> <p>The old-school approach to fan-out is a "Push" architecture, which is very effective.</p> <p>Theoretically, multiple P2.py's pulling from a common queue is the optimal allocation of resources. This is often ideal, but it's also a fair amount of programming. Is the programming really necessary? Or will round-robin processing be good enough?</p> <p>Practically, you'll find that making P1.py do a simple "round robin" dealing among multiple P2.py's may be quite good. You'd have P1.py configured to deal to <em>n</em> copies of P2.py via named pipes. The P2.py's would each read from their appropriate pipe. </p> <p>What if one P2.py gets all the "worst case" data and runs way behind? Yes, round-robin isn't perfect. But it's better than only one P2.py and you can address this bias with simple randomization.</p> <p>Fan-in from multiple P2.py's to one P3.py is a bit more complex, still. At this point, the old-school approach stops being advantageous. P3.py needs to read from multiple named pipes using the <code>select</code> library to interleave the reads.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload