StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>I think memory cache mishits explain the non-linear growth of the execution time. In my x86_64 architecture PC (Linux + gcc), <code>sizeof(double)</code> is 8 and <code>sizeof(point_t)</code> is 32 so less elements fit in cache memory. But bigger reason for the non-linear growth is that memory accesses to the <code>point_t</code> structures through the pointer array in your code will be quickly highly randomized and more and more cache misses occur because of this...</p> <p>I changed the code as follows:</p> <pre><code>--- test2.c +++ test3.c -80,14 +80,12 if (a == NULL) printf("Error in an"); - for (i = 0; i < N[4]; i++) { - ap[i] = a + i; - } - for (i = 0; i < 5; i++) { start = clock(); for (j = 0; j < ntest; j++) { for (k = 0; k < N[i]; k++) { + ap[k] = a + k; ap[k]->x = (double) rand() / (double) RAND_MAX; } </code></pre> <p>and the execution time growth is more linear.</p> <p>Original <code>quicselect()</code> with <code>double</code> array:</p> <pre><code> 1000 0.00000 10000 0.04000 100000 0.22000 1000000 1.98000 10000000 20.73000 </code></pre> <p>Original <code>quickselect()</code> with <code>point_t *</code> array:</p> <pre><code> 1000 0.01000 10000 0.02000 100000 0.71000 1000000 8.64000 10000000 157.77000 </code></pre> <p>Exactly the same <code>quickselect()</code> with <code>point_t *</code> array as above, but making sure the pointers in the array are in consequtive order before calling <code>quickselect()</code> by applying the above patch:</p> <pre><code> 1000 0.00000 10000 0.02000 100000 0.40000 1000000 4.71000 10000000 49.80000 </code></pre> <p>Note that even if the modified version does the extra sorting in the timing loop, it is still faster.</p> <p>I am running 3.2GHz Pentium(R) Dual-Core CPU E6700, 64-bit Linux, gcc 4.6, optimization <code>-O2</code>. (My machine is not idle, so my benchmark figures have some fluctuations - also I would consider using <code>clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ...)</code> for increased accuracy in Linux system if I were making more serious benchmarks to calculate out the time when kernel is not scheduling the benchmarked process to run.)</p> <p><strong>UPDATE:</strong> for example, <code>valgrind</code> (if supported by your platform) can be used to analyse the impact of the cache hits. I modified the program to take two arguments, the array size (corresponding to the elements of the array <code>N[]</code>) and the test count (corresponding to <code>ntest</code>). Execution times without <code>valgrind</code>, where<code>test2</code> is essentially the unmodified program listed in the question and <code>test4</code> is the modifed version which rearranges the <code>ap[]</code> array before calling the <code>quickselect()</code> function:</p> <pre><code>bash$ ./test2 10000000 100 Array of 10000000 elements, 100 times: 154.40000 seconds bash$ ./test4 10000000 100 Array of 10000000 elements, 100 times: 48.45000 seconds </code></pre> <p>Here's the result of running <code>valgrind</code> using cachegrind tool:</p> <pre><code>bash$ valgrind --tool=cachegrind ./test2 10000000 100 ==23563== Cachegrind, a cache and branch-prediction profiler ==23563== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al. ==23563== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for copyright info ==23563== Command: ./test2 10000000 100 ==23563== Array of 10000000 elements, 100 times: 1190.24000 seconds ==23563== ==23563== I refs: 80,075,384,594 ==23563== I1 misses: 1,091 ==23563== LLi misses: 1,087 ==23563== I1 miss rate: 0.00% ==23563== LLi miss rate: 0.00% ==23563== ==23563== D refs: 36,670,292,139 (25,550,518,762 rd + 11,119,773,377 wr) ==23563== D1 misses: 4,218,722,190 ( 3,223,975,942 rd + 994,746,248 wr) ==23563== LLd misses: 4,190,889,241 ( 3,198,934,125 rd + 991,955,116 wr) ==23563== D1 miss rate: 11.5% ( 12.6% + 8.9% ) ==23563== LLd miss rate: 11.4% ( 12.5% + 8.9% ) ==23563== ==23563== LL refs: 4,218,723,281 ( 3,223,977,033 rd + 994,746,248 wr) ==23563== LL misses: 4,190,890,328 ( 3,198,935,212 rd + 991,955,116 wr) ==23563== LL miss rate: 3.5% ( 3.0% + 8.9% ) </code></pre> <p>and </p> <pre><code>bash$ valgrind --tool=cachegrind ./test4 10000000 100 ==24436== Cachegrind, a cache and branch-prediction profiler ==24436== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al. ==24436== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for copyright info ==24436== Command: ./test4 10000000 100 ==24436== Array of 10000000 elements, 100 times: 862.89000 seconds ==24436== ==24436== I refs: 82,985,384,485 ==24436== I1 misses: 1,093 ==24436== LLi misses: 1,089 ==24436== I1 miss rate: 0.00% ==24436== LLi miss rate: 0.00% ==24436== ==24436== D refs: 36,640,292,192 (24,530,518,829 rd + 12,109,773,363 wr) ==24436== D1 misses: 2,814,232,350 ( 2,189,229,679 rd + 625,002,671 wr) ==24436== LLd misses: 2,796,287,872 ( 2,171,294,250 rd + 624,993,622 wr) ==24436== D1 miss rate: 7.6% ( 8.9% + 5.1% ) ==24436== LLd miss rate: 7.6% ( 8.8% + 5.1% ) ==24436== ==24436== LL refs: 2,814,233,443 ( 2,189,230,772 rd + 625,002,671 wr) ==24436== LL misses: 2,796,288,961 ( 2,171,295,339 rd + 624,993,622 wr) ==24436== LL miss rate: 2.3% ( 2.0% + 5.1% ) </code></pre> <p>See <a href="http://valgrind.org/docs/manual/cg-manual.html" rel="noreferrer">Valgrind manual</a> for how to read these statistics. An essential point is: <em>"On a modern machine, an L1 miss will typically cost around 10 cycles, an LL miss can cost as much as 200 cycles"</em>. Calculating LLd (<em>last level data</em> cache) misshit cost difference between the two cases (each difference in mishits times the assumed 200 cycles per 3.2e9 cycles/second for 3.2GHz CPU) gives</p> <pre><code>bash$ echo $(( (4190889241 - 2796287872) * 200 / 3200000000 )) seconds 87 seconds </code></pre> <p>The D1 misses contribute quite little here (given total of 91 seconds if D1 misshit cost is independent of LLd misshit cost); with all our inaccuraces (most notably about the actual cost of LLd misshit in this computer) the D1 misses can be just ignored.</p> <p>The run-time differences for <code>test2</code> and <code>test4</code> come to about <em>106 seconds which is reasonably close to the above 86 seconds</em>. This all could be made more accurate, but this seems to demonstrate already the effect of cache misses in the test arrangement.</p> <p>P.S. <code>valgrind</code> writes a log file, where I could verify that it seemed to detect the L2 and L1 cache sizes and types correctly.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload