Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Update: [Keeping my original mistaken post below]</p> <p>Ok. So we both had the math wrong (both in your question and in my initial answer). The sanity check that says (Total Count * Mean( should be roughly equal to the runtime is simply bogus math.</p> <p>This is simply demonstrated with the following thought exercise:</p> <p>the following very simple scenario: You measure an application that runs perfectly (Idle, all intervals showing 0 hiccup) for 100 seconds. Then you ^Z the system for 100 seconds. Then you let it run in the foreground again and immediately ^C it. Total run time is 200 seconds.</p> <p>Here are some proper, "how a human would describe this in numbers" statements about such a system:</p> <ul> <li>The 99.99%'ile will be very close to 100,000 msec.</li> <li>The 10%'ile, 20%'ile, ... 50%'ile will all be very close to 0 msec.</li> <li>During the 1st 100 seconds (and the first 100,000 sample periods), the mean is 0 msec.</li> <li>During the 2nd 100 seconds (and the second 100,000 sample periods) the mean is 50,000 msec.</li> <li>The overall mean is 25,000 milliseconds (over the total run time of 200 second period).</li> </ul> <p>But there will (properly) be 200,000 results reported in the TotalCount for the histogram (one for each 1msec of elapsed time).</p> <p>So (Mean * Total Count) = 200,000 * 25,000 == 5,000,000,000</p> <p>That's 25,000x as much as the wall clock time. And there is nothing wrong with that. There is an ~N^2 thing going on with that multiplication (actually N * (N-1) / 2), and that's exactly how it would properly behave...</p> <p>The key here is that for each percentile, the histogram [correctly] reports the percentage of overall random samples (performed without coordination over time) that would encounter hiccup levels below a certain magnitude. When a single long freeze occurs (e.g. as seen in the 2,654 msec max in your sample output), that's not 1 point in time that would be affected. There would be as many sample points in time affected as there are milliseconds in the freeze. Properly measured and represented, a single freeze would generate roughly 1 count per msec, with values varying linearly between the max value and 0 (2654msec, 2653msec, 2652msec, ... 3msec, 2msec, 1msec). This is simply the right way to calculate the spread of the possible latencies that that a random, uncoordinated sample would have encountered.</p> <p>I'm somewhat embarrassed, because my original answer repeated the same mistaken assumption that I warn about (with graphs in slides that exactly discuss the above scenario) in my talk titled "How NOT to measure Latency" (<a href="http://www.infoq.com/presentations/latency-pitfalls" rel="nofollow">http://www.infoq.com/presentations/latency-pitfalls</a> , and this is under "The coordinated omission problem" about 32 minutes in). I built both jHiccup and HdrHistogram with the right math to avoid this problem, which is why you see the right results in your data.</p> <p>In fact, had your sanity check shown "good" numbers, with max times in the 1000s of msec, that would be a good indication that the measurement system is exhibiting coordinated omission...</p> <p>-- Gil.</p> <p>-------- Original Answer, which was "wrong" -----------</p> <p>Can you post an example .hgrm data set that shows this? Files I just looked at seem to show a close relationship between the total instrumented run time and the counts in the .hgrm file. The simplest sanity test is that Total Count * Mean (both reported at the bottom of the file) should be very close to runtime. That data come from the same histogram that produces the percentile lines...</p> <p>Note that the counts at the percentile lines are not additive for a pretty simple reason. Two percentile levels could be reported for the same exact value. This is quite common for long runs that have fairly little noise in them. However, your math above (only counting the delta counts) "should" work right.</p> <p>[Small note, your math is off by 1 line: Each percentile line shows the percentile and count of values that have values at or below the value on the line. So value should be multiplied by the delta between counts at the next line (n+1) and the count at "this" line (n), but that should not have made a big difference in you measurement].</p> <p>Below is a sample file forma control run that took ~1685 seconds, where jHiccup started instrumenting 30 seconds into the run (so ~1655 seconds of instrumented time). Using excel on this data set, the results of:</p> <p>Sum( (TotalCountIncludingThisValue[n+1] - TotalCountIncludingThisValue[n]) * Value[n] )</p> <p>Shows about ~16237 seconds, and mean * total count shows ~16591 seconds.</p> <h2>jHiccup histogram report, Mon Apr 22 15:28:07 CDT 2013 :</h2> <p>Value, Percentile, TotalCountIncludingThisValue</p> <pre><code>1.01 0.000000000000 47 1.06 0.100000000000 483774 1.06 0.200000000000 483774 1.06 0.300000000000 483774 1.10 0.400000000000 628589 1.11 0.500000000000 804685 1.12 0.550000000000 1024190 1.12 0.600000000000 1024190 1.12 0.650000000000 1024190 1.13 0.700000000000 1388741 1.13 0.750000000000 1388741 1.13 0.775000000000 1388741 1.13 0.800000000000 1388741 1.13 0.825000000000 1388741 1.13 0.850000000000 1388741 1.13 0.875000000000 1388741 1.13 0.887500000000 1388741 1.13 0.900000000000 1388741 1.13 0.912500000000 1388741 1.14 0.925000000000 1471124 1.14 0.937500000000 1471124 1.14 0.943750000000 1471124 1.14 0.950000000000 1471124 1.14 0.956250000000 1471124 1.14 0.962500000000 1471124 1.14 0.968750000000 1471124 1.14 0.971875000000 1471124 1.14 0.975000000000 1471124 1.14 0.978125000000 1492169 1.14 0.981250000000 1492169 1.14 0.984375000000 1492169 1.14 0.985937500000 1492169 1.14 0.987500000000 1492169 1.14 0.989062500000 1492169 1.15 0.990625000000 1500639 1.15 0.992187500000 1500639 1.15 0.992968750000 1500639 1.15 0.993750000000 1500639 1.15 0.994531250000 1500639 1.16 0.995312500000 1504895 1.16 0.996093750000 1504895 1.16 0.996484375000 1504895 1.16 0.996875000000 1504895 1.16 0.997265625000 1504895 1.16 0.997656250000 1504895 1.17 0.998046875000 1506535 1.17 0.998242187500 1506535 1.17 0.998437500000 1506535 1.17 0.998632812500 1506535 1.17 0.998828125000 1506535 1.18 0.999023437500 1507442 1.18 0.999121093750 1507442 1.18 0.999218750000 1507442 1.18 0.999316406250 1507442 1.18 0.999414062500 1507442 1.18 0.999511718750 1507836 1.18 0.999560546875 1507836 1.18 0.999609375000 1507836 1.18 0.999658203125 1507836 1.18 0.999707031250 1507836 1.19 0.999755859375 1508028 1.19 0.999780273438 1508028 1.19 0.999804687500 1508028 1.19 0.999829101563 1508028 1.20 0.999853515625 1508106 1.20 0.999877929688 1508106 1.21 0.999890136719 1508145 1.21 0.999902343750 1508145 1.22 0.999914550781 1508163 1.22 0.999926757813 1508174 1.25 0.999938964844 1508186 1.34 0.999945068359 1508195 1.82 0.999951171875 1508204 2.42 0.999957275391 1508213 3.54 0.999963378906 1508222 4.74 0.999969482422 1508231 5.09 0.999972534180 1508236 5.98 0.999975585938 1508241 6.24 0.999978637695 1508245 7.01 0.999981689453 1508251 7.97 0.999984741211 1508254 8.26 0.999986267090 1508257 8.96 0.999987792969 1508259 9.02 0.999989318848 1508261 9.98 0.999990844727 1508265 10.24 0.999992370605 1508266 11.01 0.999993133545 1508268 11.01 0.999993896484 1508268 11.97 0.999994659424 1508269 13.95 0.999995422363 1508271 14.98 0.999996185303 1508272 14.98 0.999996566772 1508272 16.00 0.999996948242 1508273 16.00 0.999997329712 1508273 17.02 0.999997711182 1508274 18.05 0.999998092651 1508275 18.05 0.999998283386 1508275 18.05 0.999998474121 1508275 18.05 0.999998664856 1508275 19.07 0.999998855591 1508276 19.07 0.999999046326 1508276 19.07 0.999999141693 1508276 19.07 0.999999237061 1508276 19.07 0.999999332428 1508276 19.97 0.999999427795 1508277 19.97 1.000000000000 1508277 #[Mean = 1.10, StdDeviation = 0.06] #[Max = 19.84, Total count = 1508277] #[Buckets = 25, SubBuckets = 256] </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload