StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POPython multiprocess with pool workers - memory use optimization
text
Body
copied!<p>I have a fuzzy string matching script that looks for some 30K needles in a haystack of 4 million company names. While the script works fine, my attempts at speeding up things via parallel processing on an AWS h1.xlarge failed as I'm running out of memory. </p> <p>Rather than trying to get more memory as explained in response to <a href="https://stackoverflow.com/questions/18706532/python-multiprocess-using-pool-fails-on-aws-ubuntu" title="my previous question">my previous question</a>, I'd like to find out how to optimize the workflow - I'm fairly new to this so there should be plenty of room. Btw, I've already experimented with <a href="https://stackoverflow.com/questions/9038711/python-pool-with-worker-processes?rq=1" title="queues">queues</a> (also worked but ran into the same <code>MemoryError</code>, plus looked through a bunch of very helpful SO contributions, but not quite there yet. </p> <p>Here's what seems most relevant of the code. I hope it sufficiently clarifies the logic - happy to provide more info as needed:</p> <pre><code>def getHayStack(): ## loads a few million company names into id: name dict return hayCompanies def getNeedles(*args): ## loads subset of 30K companies into id: name dict (for allocation to workers) return needleCompanies def findNeedle(needle, haystack): """ Identify best match and return results with score """ results = {} for hayID, hayCompany in haystack.iteritems(): if not isnull(haystack[hayID]): results[hayID] = levi.setratio(needle.split(' '), hayCompany.split(' ')) scores = list(results.values()) resultIDs = list(results.keys()) needleID = resultIDs[scores.index(max(scores))] return [needleID, haystack[needleID], max(scores)] def runMatch(args): """ Execute findNeedle and process results for poolWorker batch""" batch, first = args last = first + batch hayCompanies = getHayStack() needleCompanies = getTargets(first, last) needles = defaultdict(list) current = first for needleID, needleCompany in needleCompanies.iteritems(): current += 1 needles[targetID] = findNeedle(needleCompany, hayCompanies) ## Then store results if __name__ == '__main__': pool = Pool(processes = numProcesses) totalTargets = len(getTargets('all')) targetsPerBatch = totalTargets / numProcesses pool.map_async(runMatch, itertools.izip(itertools.repeat(targetsPerBatch), xrange(0, totalTargets, targetsPerBatch))).get(99999999) pool.close() pool.join() </code></pre> <p>So I guess the questions are: How can I avoid loading the haystack for all workers - e.g. by sharing the data or taking a different approach like dividing the much larger haystack across workers rather than the needles? How can I otherwise improve memory usage by avoiding or eliminating clutter?</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload