Note that there are some explanatory texts on larger screens.

plurals
  1. POThrottling Popen() calls
    primarykey
    data
    text
    <p>How much danger is there from starting too many processes with Popen() before the initial Popens have resolved?</p> <p>I am doing some processing on a directory filled with PDFs. I iterate over each file and do two things using external calls.</p> <p>First, I get the an html representation from the Xpdf-based pdftohtml tool (pdfminer is too slow). This makes an output of only the first page:</p> <pre><code>html = check_output(['pdftohtml.exe','-f','1','-l','1','-stdout','-noframes',pdf]) </code></pre> <p>then if my conditions are met (I identify that it is the right document), I call tabula-extractor on it to extract a table. This is a slow/long running process compared to checking the document and only happens on maybe 1/20 files.</p> <p>if I just do <code>call(['jruby', 'C:\\jruby-1.7.4\\bin\\tabula', .....])</code>, I will spend a long time waiting for the extraction to complete while I could be checking more files (I've got 4 cores and 16gb of ram and Tabula doesn't seem to multithread).</p> <p>So instead, I am using Popen() to avoid blocking.</p> <pre><code>Popen(['jruby', 'C:\\jruby-1.7.4\\bin\\tabula', '-o', csv, '-f', 'CSV', '-a', "'",topBorder, ',', leftBorder, ',', bottomBorder, ',', rightBorder, "'", '-p', '1', pdf]) #where CSV is the name of the output file and pdf is the name of the input </code></pre> <p>I don't care about the return value (tabula is creating a csv file, so I can always see after the fact if it was created sucessfully). Doing it this way means that I can keep checking files in the background and starting more tabula processes as needed (again, only about 1 in 20).</p> <p>This works, but it gets backlogged and ends up running a ton of tabula processes at once. So my questions are: Is this bad? It makes the computer slow for anything else, but as long as it doesn't crash and is working as fast as it can, I don't really mind (all 4 cores sit at 100% the whole time, but memory usage doesn't go above 5.5GB, so it appears CPU-bound).</p> <p>If it is bad, what is the right way to improve it? Is there a convenient way to say, queue up tabula processes so there are always 1-2 running per core, but I am not trying to process 30 files at once?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload