StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POMemory performance/cache puzzle
text
Body
copied!<p>I have a memory performance puzzle. I'm trying to benchmark how long it takes to fetch a byte from main memory, and how various BIOS settings and memory hardware parameters influence it. I wrote the following code for Windows that, in a loop, flushes the cache by reading another buffer and then reads a target buffer a single byte at a time with varying strides. I figure that once the stride is the cache line size, that's the quantity I'm trying to measure since each read goes to main memory. Here's the benchmark code (note that the size of the buffer is the stride x 1MB and that I pin the thread to core 1): </p> <pre><code>#include <stdio.h> #include <memory.h> #define NREAD (1024*1024) #define CACHE_SIZE (50*1024*1024) char readTest(int stride) { LARGE_INTEGER frequency; LARGE_INTEGER start; LARGE_INTEGER end; int rep, i,ofs; double time, min_time=1e100, max_time=0.0, mean_time=0.0; char *buf = (char *)malloc(NREAD*stride); char *flusher = (char *)malloc(CACHE_SIZE); char jnk=0; for(rep=0; rep<255; rep++) { // read the flusher to flush the cache for(ofs = 0; ofs<CACHE_SIZE; ofs+=64) jnk+=flusher[ofs]; if (QueryPerformanceFrequency(&frequency) == FALSE) exit(-1); if (QueryPerformanceCounter(&start) == FALSE) exit(-2); // here's the timed loop for(ofs=0; ofs<NREAD*stride; ofs+=stride) jnk += buf[ofs]; if (QueryPerformanceCounter(&end) == FALSE) exit(-3); time = (double)(end.QuadPart - start.QuadPart) / (double)frequency.QuadPart*1e6; max_time = time > max_time ? time : max_time; min_time = time < min_time ? time : min_time; mean_time += time; } mean_time /= 255; printf("Stride = %4i, Max: %6.0f us, Min: %6.0f us, Mean: %6.0f us, B/W: %4.0f MB/s\n", stride, max_time, min_time, mean_time, NREAD/min_time); free(buf); free(flusher); return jnk; } int main(int argc, char* argv[]) { SetThreadAffinityMask(GetCurrentThread(), 1); // pin to core 1 to avoid weirdness // run the tests readTest(1); readTest(2); readTest(4); readTest(6); readTest(8); readTest(12); readTest(16); readTest(24); readTest(32); readTest(48); readTest(64); readTest(96); readTest(128); readTest(192); readTest(256); readTest(384); readTest(512); readTest(768); readTest(1024); readTest(1536); return 0; } </code></pre> <p>The inner loop that is timed assembles as:</p> <pre><code> // here's the timed loop for(ofs=0; ofs<NREAD*stride; ofs+=stride) jnk += buf[ofs]; 00F410AF xor eax,eax 00F410B1 test edi,edi 00F410B3 jle readTest+0C2h (0F410C2h) 00F410B5 mov edx,dword ptr [buf] 00F410B8 add bl,byte ptr [eax+edx] 00F410BB add eax,dword ptr [stride] 00F410BE cmp eax,edi 00F410C0 jl readTest+0B5h (0F410B5h) </code></pre> <p>I ran this on a dual-processor E5-2609 machine, and here are the results:</p> <pre><code>Stride = 1, Max: 2362 us, Min: 937 us, Mean: 950 us, B/W: 1119 MB/s Stride = 2, Max: 1389 us, Min: 968 us, Mean: 978 us, B/W: 1083 MB/s Stride = 4, Max: 1694 us, Min: 1026 us, Mean: 1037 us, B/W: 1022 MB/s Stride = 6, Max: 2418 us, Min: 1098 us, Mean: 1124 us, B/W: 955 MB/s Stride = 8, Max: 2835 us, Min: 1234 us, Mean: 1252 us, B/W: 850 MB/s Stride = 12, Max: 4203 us, Min: 1527 us, Mean: 1559 us, B/W: 687 MB/s Stride = 16, Max: 5130 us, Min: 1816 us, Mean: 1849 us, B/W: 577 MB/s Stride = 24, Max: 7370 us, Min: 2408 us, Mean: 2449 us, B/W: 435 MB/s Stride = 32, Max: 10039 us, Min: 2901 us, Mean: 3014 us, B/W: 361 MB/s Stride = 48, Max: 14248 us, Min: 4652 us, Mean: 4731 us, B/W: 225 MB/s Stride = 64, Max: 19149 us, Min: 6340 us, Mean: 6447 us, B/W: 165 MB/s Stride = 96, Max: 28848 us, Min: 8475 us, Mean: 8615 us, B/W: 124 MB/s Stride = 128, Max: 37449 us, Min: 9900 us, Mean: 10160 us, B/W: 106 MB/s Stride = 192, Max: 51718 us, Min: 11282 us, Mean: 11563 us, B/W: 93 MB/s Stride = 256, Max: 62193 us, Min: 11558 us, Mean: 11924 us, B/W: 91 MB/s Stride = 384, Max: 86943 us, Min: 11829 us, Mean: 12260 us, B/W: 89 MB/s Stride = 512, Max: 108661 us, Min: 11847 us, Mean: 12401 us, B/W: 89 MB/s Stride = 768, Max: 167951 us, Min: 11797 us, Mean: 12946 us, B/W: 89 MB/s Stride = 1024, Max: 211700 us, Min: 12893 us, Mean: 13979 us, B/W: 81 MB/s Stride = 1536, Max: 332214 us, Min: 12967 us, Mean: 15077 us, B/W: 81 MB/s </code></pre> <p>Here are my questions:</p> <ul> <li>Why does the performance continue to degrade after the stride is larger than the cache-line size (64 bytes for Sandy Bridge)? I would assume that the worst performance would occur once the stride is large enough to require a cache-line transfer for every read, but even after that the time increases by a factor of two... What am I missing?</li> <li>Why is the max time (which happens on the first iteration of the loop) 2-4x longer than the minimum time? I'm flushing the cache every iteration...</li> </ul>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload