Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>I have a program that parses log files and generates a data warehouse. A typical run involves around 200M log file lines, and runs for the better part of a day. Well worth optimizing!</p> <p>Since it's a parser, and parsing some rather variable and idiosyncratic and untrustworthy text at that, there are around 100 regular expressions, dutifully re.compiled() in advance, and applied to each of the 200M log file lines. I was pretty sure they were my bottleneck, and had been pondering how to improve that situation. Had some ideas: on the one hand, make fewer, fancier REs; on the other, more and simpler; stuff like that.</p> <p>I profiled with CProfile, and looked at the result in "runsnake".</p> <p>RE processing was only about 10% of code execution time. That's not it!</p> <p>In fact, a large square blob in the runsnake display instantly told me that about 60% of my time was spent in one of those infamous "one line changes" I'd added one day, eliminating non-printing characters (which appear occasionally, but always represent something so bogus I really don't care about it). These were confusing my parse and throwing exceptions, which I <em>did</em> care about because it halted my day of log file analysis.</p> <blockquote> <p>line = ''.join([c for c in line if curses.ascii.isprint(c) ])</p> </blockquote> <p>There you go: that line touches every <strong>byte</strong> of every one of those 200M lines (and the lines average a couple hundred bytes long). No wonder it's 60% of my execution time!</p> <p>There are better ways to handle this, I now know, such as str.translate(). But such lines are rare, and I don't care about them anyway, and they end up throwing an exception: now I just catch the exception at the right spot and skip the line. Voila! the program's around 3X faster, instantly!</p> <p>So the profiling</p> <ol> <li>highlighted, in around one second, where the problem actually was</li> <li>drew my attention away from a mistaken assumption about where the problem was (which might be the even greater pay-off)</li> </ol>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload