Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p><code>ncalls</code> is relevant only to the extent that comparing the numbers against other counts such as number of chars/fields/lines in a file may highligh anomalies; <code>tottime</code> and <code>cumtime</code> is what really matters. <code>cumtime</code> is the time spent in the function/method <em>including</em> the time spent in the functions/methods that it calls; <code>tottime</code> is the time spent in the function/method <em>excluding</em> the time spent in the functions/methods that it calls.</p> <p>I find it helpful to sort the stats on <code>tottime</code> and again on <code>cumtime</code>, not on <code>name</code>.</p> <p><code>bgchar</code> <em>definitely</em> refers to the execution of the script and is not irrelevant as it takes up 8.9 seconds out of 13.5; that 8.9 seconds does NOT include time in the functions/methods that it calls! Read carefully what @Lie Ryan says about modularising your script into functions, and implement his advice. Likewise what @jonesy says.</p> <p><code>string</code> is mentioned because you <code>import string</code> and use it in only one place: <code>string.find(elements[0], 'p')</code>. On another line in the output you'll notice that string.find was called only once, so it's not a performance problem in this run of this script. HOWEVER: You use <code>str</code> methods everywhere else. <code>string</code> functions are deprecated nowadays and are implemented by calling the corresponding <code>str</code> method. You would be better writing <code>elements[0].find('p') == 0</code> for an exact but faster equivalent, and might like to use <code>elements[0].startswith('p')</code> which would save readers wondering whether that <code>== 0</code> should actually be <code>== -1</code>.</p> <p>The four methods mentioned by @Bernd Petersohn take up only 3.7 seconds out of a total execution time of 13.541 seconds. Before worrying too much about those, modularise your script into functions, run cProfile again, and sort the stats by <code>tottime</code>.</p> <p><strong>Update after question revised with changed script:</strong></p> <p>"""Question: What can I do about join, split and write operations to reduce the apparent impact they have on the performance of this script?""</p> <p>Huh? Those 3 together take 2.6 seconds out of the total of 13.8. Your parseJarchLine function is taking 8.5 seconds (which doesn't include time taken by functions/methods that it calls. <code>assert(8.5 &gt; 2.6)</code></p> <p>Bernd has already pointed you at what you might consider doing with those. You are needlessly splitting the line completely only to join it up again when writing it out. You need to inspect only the first element. Instead of <code>elements = line.split('\t')</code> do <code>elements = line.split('\t', 1)</code> and replace <code>'\t'.join(elements[1:])</code> by <code>elements[1]</code>.</p> <p>Now let's dive into the body of parseJarchLine. The number of uses in the source and manner of the uses of the <code>long</code> built-in function are astonishing. Also astonishing is the fact that <code>long</code> is not mentioned in the cProfile output.</p> <p>Why do you need <code>long</code> at all? Files over 2 Gb? OK, then you need to consider that since Python 2.2, <code>int</code> overflow causes promotion to <code>long</code> instead of raising an exception. You can take advantage of faster execution of <code>int</code> arithmetic. You also need to consider that doing <code>long(x)</code> when <code>x</code> is already demonstrably a <code>long</code> is a waste of resources.</p> <p>Here is the parseJarchLine function with removing-waste changes marked [1] and changing-to-int changes marked [2]. Good idea: make changes in small steps, re-test, re-profile.</p> <pre><code>def parseJarchLine(chromosome, line): global pLength global lastEnd elements = line.split('\t') if len(elements) &gt; 1: if lastEnd != "": start = long(lastEnd) + long(elements[0]) # [1] start = lastEnd + long(elements[0]) # [2] start = lastEnd + int(elements[0]) lastEnd = long(start + pLength) # [1] lastEnd = start + pLength sys.stdout.write("%s\t%ld\t%ld\t%s\n" % (chromosome, start, lastEnd, '\t'.join(elements[1:]))) else: lastEnd = long(elements[0]) + long(pLength) # [1] lastEnd = long(elements[0]) + pLength # [2] lastEnd = int(elements[0]) + pLength sys.stdout.write("%s\t%ld\t%ld\t%s\n" % (chromosome, long(elements[0]), lastEnd, '\t'.join(elements[1:]))) else: if elements[0].startswith('p'): pLength = long(elements[0][1:]) # [2] pLength = int(elements[0][1:]) else: start = long(long(lastEnd) + long(elements[0])) # [1] start = lastEnd + long(elements[0]) # [2] start = lastEnd + int(elements[0]) lastEnd = long(start + pLength) # [1] lastEnd = start + pLength sys.stdout.write("%s\t%ld\t%ld\n" % (chromosome, start, lastEnd)) return </code></pre> <p><strong>Update after question about <code>sys.stdout.write</code></strong></p> <p>If the statement that you commented out was anything like the original one:</p> <pre><code>sys.stdout.write("%s\t%ld\t%ld\t%s\n" % (chromosome, start, lastEnd, '\t'.join(elements[1:]))) </code></pre> <p>Then your question is ... interesting. Try this:</p> <pre><code>payload = "%s\t%ld\t%ld\t%s\n" % (chromosome, start, lastEnd, '\t'.join(elements[1:])) sys.stdout.write(payload) </code></pre> <p><em>Now</em> comment out the <code>sys.stdout.write</code> statement ...</p> <p>By the way, someone mentioned in a comment about breaking this into more than one write ... have you considered this? How many bytes on average in elements[1:] ? In chromosome?</p> <p>=== change of topic: It worries me that you initialise <code>lastEnd</code> to <code>""</code> rather than to zero, and that nobody has commented on it. Any way, you should fix this, which allows a rather drastic simplification plus adding in others' suggestions:</p> <pre><code>def parseJarchLine(chromosome, line): global pLength global lastEnd elements = line.split('\t', 1) if elements[0][0] == 'p': pLength = int(elements[0][1:]) return start = lastEnd + int(elements[0]) lastEnd = start + pLength sys.stdout.write("%s\t%ld\t%ld" % (chromosome, start, lastEnd)) if elements[1:]: sys.stdout.write(elements[1]) sys.stdout.write(\n) </code></pre> <p>Now I'm similarly worried about the two global variables <code>lastEnd</code> and <code>pLength</code> -- the parseJarchLine function is now so small that it can be folded back into the body of its sole caller, <code>extractData</code>, which saves two global variables, and a gazillion function calls. You could also save a gazillion lookups of <code>sys.stdout.write</code> by putting <code>write = sys.stdout.write</code> once up the front of <code>extractData</code> and using that instead.</p> <p>BTW, the script tests for Python 2.5 or better; have you tried profiling on 2.5 and 2.6?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload