Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Do you mind too much if I talk a bit about profiling, what works and what doesn't?</p> <p>Let's make up an artificial program, some of whose statements are doing work that can be optimized away - i.e. they are not really necessary. They are "bottlenecks".</p> <p>Subroutine <code>foo</code> runs a CPU-bound loop that takes one second. Also assume subroutine CALL and RETURN instructions take insignificant or zero time, compared to everything else.</p> <p>Subroutine <code>bar</code> calls <code>foo</code> 10 times, but 9 of those times are unnecessary, which you don't know in advance and can't tell until your attention is directed there.</p> <p>Subroutines <code>A</code>, <code>B</code>, <code>C</code>, ..., <code>J</code> are 10 subroutines, and they each call <code>bar</code> once.</p> <p>The top-level routine <code>main</code> calls each of <code>A</code> through <code>J</code> once.</p> <p>So the total call tree looks like this:</p> <pre><code>main A bar foo foo ... total 10 times for 10 seconds B bar foo foo ... ... J ... (finished) </code></pre> <p>How long does it all take? 100 seconds, obviously.</p> <p>Now let's look at profiling strategies. Stack samples (like say 1000 samples) are taken at uniform intervals.</p> <ol> <li><p>Is there any self time? Yes. <code>foo</code> takes 100% of the self time. It's a genuine "hot spot". Does that help you find the bottleneck? No. Because it is not in <code>foo</code>.</p></li> <li><p>What is the hot path? Well, the stack samples look like this:</p> <p>main -> A -> bar -> foo (100 samples, or 10%)<br> main -> B -> bar -> foo (100 samples, or 10%)<br> ...<br> main -> J -> bar -> foo (100 samples, or 10%)</p></li> </ol> <p>There are 10 hot paths, and none of them look big enough to gain you much speedup.</p> <p>IF YOU HAPPEN TO GUESS, and IF THE PROFILER ALLOWS, you could make <code>bar</code> the "root" of your call tree. Then you would see this:</p> <pre><code>bar -&gt; foo (1000 samples, or 100%) </code></pre> <p>Then you would know that <code>foo</code> and <code>bar</code> were each independently responsible for 100% of the time and therefore are places to look for optimization. You look at <code>foo</code>, but of course you know the problem isn't there. Then you look at <code>bar</code> and you see the 10 calls to <code>foo</code>, and you see that 9 of them are unnecessary. Problem solved.</p> <p>IF YOU DIDN'T HAPPEN TO GUESS, and instead the profiler simply showed you the percent of samples containing each routine, you would see this:</p> <pre><code>main 100% bar 100% foo 100% A 10% B 10% ... J 10% </code></pre> <p>That tells you to look at <code>main</code>, <code>bar</code>, and <code>foo</code>. You see that <code>main</code> and <code>foo</code> are innocent. You look at where <code>bar</code> calls <code>foo</code> and you see the problem, so it's solved.</p> <p>It's even clearer if in addition to showing you the functions, you can be shown the lines where the functions are called. That way, you can find the problem no matter how large the functions are in terms of source text.</p> <p>NOW, let's change <code>foo</code> so that it does <code>sleep(oneSecond)</code> rather than be CPU bound. How does that change things?</p> <p>What it means is it still takes 100 seconds by the wall clock, but the CPU time is zero. Sampling in a CPU-only sampler will show <em>nothing</em>.</p> <p>So now you are told to try instrumentation instead of sampling. Contained among all the things it tells you, it also tells you the percentages shown above, so in this case you could find the problem, assuming <code>bar</code> was not very big. (There may be reasons to write small functions, but should satisfying the profiler be one of them?)</p> <p>Actually, the main thing wrong with the sampler was that it can't sample during <code>sleep</code> (or I/O or other blocking), and it doesn't show you code line percents, only function percents.</p> <p>By the way, 1000 samples gives you nice precise-looking percents. Suppose you took fewer samples. How many do you actually need to find the bottleneck? Well, since the bottleneck is on the stack 90% of the time, if you took only 10 samples, it would be on about 9 of them, so you'd still see it. If you even took as few as 3 samples, the probability it would appear on two or more of them is 97.2%.**</p> <p>High sample rates are way overrated, when your goal is to find bottlenecks.</p> <p>Anyway, that's why I rely on <a href="https://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux/378024#378024">random-pausing</a>.</p> <p>** How did I get 97.2 percent? Think of it as tossing a coin 3 times, a very unfair coin, where "1" means seeing the bottleneck. There are 8 possibilities:</p> <pre><code> #1s probabality 0 0 0 0 0.1^3 * 0.9^0 = 0.001 0 0 1 1 0.1^2 * 0.9^1 = 0.009 0 1 0 1 0.1^2 * 0.9^1 = 0.009 0 1 1 2 0.1^1 * 0.9^2 = 0.081 1 0 0 1 0.1^2 * 0.9^1 = 0.009 1 0 1 2 0.1^1 * 0.9^2 = 0.081 1 1 0 2 0.1^1 * 0.9^2 = 0.081 1 1 1 3 0.1^0 * 0.9^3 = 0.729 </code></pre> <p>so the probability of seeing it 2 or 3 times is .081*3 + .729 = .972</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload