Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Your design is a bit confusing. You're using a pool of N workers, and then breaking your M jobs work up into N tasks of size M/N. In other words, if you get that all correct, you're simulating worker processes on top of a pool built on top of worker processes. Why bother with that? If you want to use processes, just use them directly. Alternatively, use a pool as a pool, sends each job as its own task, and use the batching feature to batch them up in some appropriate (and tweakable) way.</p> <p>That means that <code>runMatch</code> just takes a single needleID and needleCompany, and all it does is call <code>findNeedle</code> and then do whatever that <code># Then store results</code> part is. And then the main program gets a lot simpler:</p> <pre><code>if __name__ == '__main__': with Pool(processes=numProcesses) as pool: results = pool.map_async(runMatch, needleCompanies.iteritems(), chunkSize=NUMBER_TWEAKED_IN_TESTING).get() </code></pre> <p>Or, if the results are small, instead of having all of the processes (presumably) fighting over some shared resulting-storing thing, just return them. Then you don't need <code>runMatch</code> at all, just:</p> <pre><code>if __name__ == '__main__': with Pool(processes=numProcesses) as pool: for result in pool.imap_unordered(findNeedle, needleCompanies.iteritems(), chunkSize=NUMBER_TWEAKED_IN_TESTING): # Store result </code></pre> <p>Or, alternatively, if you <em>do</em> want to do exactly N batches, just create a Process for each one:</p> <pre><code>if __name__ == '__main__': totalTargets = len(getTargets('all')) targetsPerBatch = totalTargets / numProcesses processes = [Process(target=runMatch, args=(targetsPerBatch, xrange(0, totalTargets, targetsPerBatch))) for _ in range(numProcesses)] for p in processes: p.start() for p in processes: p.join() </code></pre> <hr> <p>Also, you seem to be calling <code>getHayStack()</code> once for each task (and <code>getNeedles</code> as well). I'm not sure how easy it would be to end up with multiple copies of this live at the same time, but considering that it's the largest data structure you have by far, that would be the first thing I try to rule out. In fact, even if it's not a memory-usage problem, <code>getHayStack</code> could easily be a big performance hit, unless you're already doing some kind of caching (e.g., explicitly storing it in a global or a mutable default parameter value the first time, and then just using it), so it may be worth fixing anyway.</p> <p>One way to fix both potential problems at once is to use an initializer in the <a href="http://docs.python.org/3.3/library/multiprocessing.html#multiprocessing.pool.Pool" rel="nofollow"><code>Pool</code></a> constructor:</p> <pre><code>def initPool(): global _haystack _haystack = getHayStack() def runMatch(args): global _haystack # ... hayCompanies = _haystack # ... if __name__ == '__main__': pool = Pool(processes=numProcesses, initializer=initPool) # ... </code></pre> <hr> <p>Next, I notice that you're explicitly generating lists in multiple places where you don't actually need them. For example:</p> <pre><code>scores = list(results.values()) resultIDs = list(results.keys()) needleID = resultIDs[scores.index(max(scores))] return [needleID, haystack[needleID], max(scores)] </code></pre> <p>If there's more than a handful of results, this is wasteful; just use the <code>results.values()</code> iterable directly. (In fact, it looks like you're using Python 2.x, in which case <code>keys</code> and <code>values</code> are <em>already</em> lists, so you're just making an extra copy for no good reason.)</p> <p>But in this case, you can simplify the whole thing even farther. You're just looking for the key (resultID) and value (score) with the highest score, right? So:</p> <pre><code>needleID, score = max(results.items(), key=operator.itemgetter(1)) return [needleID, haystack[needleID], score] </code></pre> <p>This also eliminates all the repeated searches over <code>score</code>, which should save some CPU.</p> <hr> <p>This may not directly solve the memory problem, but it should hopefully make it easier to debug and/or tweak.</p> <p>The first thing to try is just to use much smaller batches—instead of input_size/cpu_count, try 1. Does memory usage go down? If not, we've ruled that part out.</p> <p>Next, try <code>sys.getsizeof(_haystack)</code> and see what it says. If it's, say, 1.6GB, then you're cutting things pretty fine trying to squeeze everything else into 0.4GB, so that's the way to attack it—e.g., use a <a href="http://docs.python.org/3/library/shelve.html" rel="nofollow"><code>shelve</code></a> database instead of a plain <code>dict</code>.</p> <p>Also try dumping memory usage (with the <a href="http://docs.python.org/3/library/resource.html" rel="nofollow"><code>resource</code></a> module, <code>getrusage(RUSAGE_SELF)</code>) at the start and end of the initializer function. If the final haystack is only, say, 0.3GB, but you allocate another 1.3GB building it up, that's the problem to attack. For example, you might spin off a single child process to build and pickle the dict, then have the pool initializer just open it and unpickle it. Or combine the two—build a <code>shelve</code> db in the first child, and open it read-only in the initializer. Either way, this would also mean you're only doing the CSV-parsing/dict-building work once instead of 8 times.</p> <p>On the other hand, if your total VM usage is still low (note that <code>getrusage</code> doesn't directly have any way to see your total VM size—<code>ru_maxrss</code> is often a useful approximation, especially if <code>ru_nswap</code> is 0) at time the first task runs, the problem is with the tasks themselves.</p> <p>First, <code>getsizeof</code> the arguments to the task function and the value you return. If they're large, especially if they either keep getting larger with each task or are wildly variable, it could just be pickling and unpickling that data takes too much memory, and eventually 8 of them are together big enough to hit the limit.</p> <p>Otherwise, the problem is most likely in the task function itself. Either you've got a memory leak (you can only have a <em>real</em> leak by using a buggy C extension module or <code>ctypes</code>, but if you keep any references around between calls, e.g., in a global, you could just be holding onto things forever unnecessarily), or some of the tasks themselves take too much memory. Either way, this should be something you can test more easily by pulling out the multiprocessing and just running the tasks directly, which is a lot easier to debug.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload