Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>So, I've been thinking all this over.</p> <p>Currently, I have two separate proposals for how CAS is handled - 'cache lock' and MESI.</p> <p>This post is purely with regard to cache lock.</p> <p>Cache lock posits that a core locks the given cache line and other cores attempting to CAS on that cache line stall still the cache is released.</p> <p>Furthemore, I believe also that CAS always writes its results back to memory before completing.</p> <p>Taking that theory, let's look at the benchmark and try to interprent the results.</p> <pre><code>Release 7 Lock-Free Freelist Benchmark #1 M N S L3U L2U L2U L1D L1D L1I L1I P P L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 0 0 1 310134488,31013449,0,1.00 0 1 0 1 136313300,6815665,38365,0.22 0 1 0 1 136401284,6820064,50706,0.22 1 1 1 1 111134328,2778358,23851,0.09 0 0 1 1 334747444,16737372,2421,0.54 1 1 1 1 111105898,2777647,40399,0.09 </code></pre> <p>So, first the single thread case;</p> <pre><code>L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 0 0 1 310134488,31013449,0,1.00 </code></pre> <p>Here we have maximum performance. Every 'slot' is used by the single thread.</p> <p>Now we come to two threads on the same core;</p> <pre><code>L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 0 1 1 334747444,16737372,2421,0.54 </code></pre> <p>Here we still of course have the same number of 'slots' - a CAS takes as long as it takes - but we see they're evenly distributed between logical processors. This makes sense; one core locks the cache line, the other stalls, the first completes, the second gets the lock... they alternate. The destination remains in L1 cache with the cache line being in the modified state; we never need to re-read the destination from memory, so in that sense we're just like the one thread case.</p> <p>Now we come to two threads on different cores;</p> <pre><code>L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 1 0 1 136401284,6820064,50706,0.22 </code></pre> <p>Here we see our first big slow down. Our maximum theoretical scaling is 0.5, but we're at 0.22. How come? well, each thread is trying to lock the same cache line (in its own cache, of course) which is fine - but the problem is when a core gets the lock, it will need to re-read the destination from memory because its cache line will have been marked invalid by the other core having modified its copy of the data. So we put the slow down to the memory reads we're having to do.</p> <p>Now we come to four threads, two per core.</p> <pre><code>L L L L total ops,mean ops/sec per thread,standard deviation,scalability 1 1 1 1 111105898,2777647,40399,0.09 </code></pre> <p>Here we see the total number of ops is actually only slightly less than one thread per core, although of course the scaling is much worse, since we now have four threads, not two.</p> <p>In the one thread per core scenario, every CAS begins with a read of memory, since the other core has invalided the CASing cores cache line.</p> <p>In this scenario, when a core finishes a CAS and releases the cache lock, three threads are competing for the lock, two on another core, one on the same core. So two thirds of the time we need to re-read memory at the start of CAS; one third of the time we do not.</p> <p>So we should be FASTER. But we are in fact SLOWER.</p> <pre><code>0% memory re-reading gives 33,474,744.4 total ops per second (two threads, same core) 66% memory re-reading, gives 11,110,589.8 total ops per second (four threads, two per core) 100% memory re-reading, gives 13,640,128.4 total ops per second (two threads, one per core) </code></pre> <p>And this puzzles me. The observed facts do not fit the theory.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload