Note that there are some explanatory texts on larger screens.

plurals
  1. POFind cpu-hogging plugin in multithreaded python
    primarykey
    data
    text
    <p>I have a system written in python that processes large amounts of data using plug-ins written by several developers with varying levels of experience.</p> <p>Basically, the application starts several worker threads, then feeds them data. Each thread determines the plugin to use for an item and asks it to process the item. A plug-in is just a python module with a specific function defined. The processing usually involves regular expressions, and should not take more than a second or so.</p> <p>Occasionally, one of the plugins will take <strong>minutes</strong> to complete, pegging the CPU on 100% for the whole time. This is usually caused by a sub-optimal regular expression paired with a data item that exposes that inefficiency.</p> <p>This is where things get tricky. If I have a suspicion of who the culprit is, I can examine its code and find the problem. However, sometimes I'm not so lucky.</p> <ul> <li>I can't go single threaded. It would probably take <em>weeks</em> to reproduce the problem if I do. </li> <li>Putting a timer on the plugin doesn't help, because when it freezes it takes the GIL with it, and all the other plugins also take minutes to complete.</li> <li>(In case you were wondering, the <a href="http://bugs.python.org/issue1366311" rel="nofollow noreferrer">SRE engine doesn't release the GIL</a>).</li> <li>As far as I can tell <a href="https://stackoverflow.com/questions/760039">profiling</a> is pretty useless when multithreading.</li> </ul> <p>Short of rewriting the whole architecture into multiprocessing, any way I can find out who is eating all my CPU?</p> <p><strong>ADDED</strong>: In answer to some of the comments:</p> <ol> <li><p>Profiling multithreaded code in python is not useful because the profiler measures the total function time and not the active cpu time. Try cProfile.run('time.sleep(3)') to see what I mean. (credit to <a href="https://stackoverflow.com/questions/653419/how-can-i-profile-a-multithread-program-in-python/653497#653497">rog</a> [last comment]).</p></li> <li><p>The reason that going single threaded is tricky is because only 1 item in 20,000 is causing the problem, and I don't know which one it is. Running multithreaded allows me to go through 20,000 items in about an hour, while single threaded can take much longer (there's a lot of network latency involved). There are some more complications that I'd rather not get into right now. </p></li> </ol> <p>That said, it's not a bad idea to try to serialize the specific code that calls the plugins, so that timing of one will not affect the timing of the others. I'll try that and report back.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload