StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
5756397
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
1
CommunityOwnedDate
CreationDate
2011-04-22T14:22:03.803
FavoriteCount
0
LastActivityDate
2011-04-22T20:32:51.797
LastEditDate
2011-04-22T20:32:51.797
LastEditorUserId
0
OwnerUserId
0
ParentId
5720007
PostTypeId
2
Score
0
ViewCount
0
LastEditorDisplayName
user82238
text
Body
So, I've been thinking all this over. Currently, I have two separate proposals for how CAS is handled - 'cache lock' and MESI. This post is purely with regard to cache lock. Cache lock posits that a core locks the given cache line and other cores attempting to CAS on that cache line stall still the cache is released. Furthemore, I believe also that CAS always writes its results back to memory before completing. Taking that theory, let's look at the benchmark and try to interprent the results. <pre><code>Release 7 Lock-Free Freelist Benchmark #1 M N S L3U L2U L2U L1D L1D L1I L1I P P L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 0 0 1 310134488,31013449,0,1.00 0 1 0 1 136313300,6815665,38365,0.22 0 1 0 1 136401284,6820064,50706,0.22 1 1 1 1 111134328,2778358,23851,0.09 0 0 1 1 334747444,16737372,2421,0.54 1 1 1 1 111105898,2777647,40399,0.09 </code></pre> So, first the single thread case; <pre><code>L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 0 0 1 310134488,31013449,0,1.00 </code></pre> Here we have maximum performance. Every 'slot' is used by the single thread. Now we come to two threads on the same core; <pre><code>L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 0 1 1 334747444,16737372,2421,0.54 </code></pre> Here we still of course have the same number of 'slots' - a CAS takes as long as it takes - but we see they're evenly distributed between logical processors. This makes sense; one core locks the cache line, the other stalls, the first completes, the second gets the lock... they alternate. The destination remains in L1 cache with the cache line being in the modified state; we never need to re-read the destination from memory, so in that sense we're just like the one thread case. Now we come to two threads on different cores; <pre><code>L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 1 0 1 136401284,6820064,50706,0.22 </code></pre> Here we see our first big slow down. Our maximum theoretical scaling is 0.5, but we're at 0.22. How come? well, each thread is trying to lock the same cache line (in its own cache, of course) which is fine - but the problem is when a core gets the lock, it will need to re-read the destination from memory because its cache line will have been marked invalid by the other core having modified its copy of the data. So we put the slow down to the memory reads we're having to do. Now we come to four threads, two per core. <pre><code>L L L L total ops,mean ops/sec per thread,standard deviation,scalability 1 1 1 1 111105898,2777647,40399,0.09 </code></pre> Here we see the total number of ops is actually only slightly less than one thread per core, although of course the scaling is much worse, since we now have four threads, not two. In the one thread per core scenario, every CAS begins with a read of memory, since the other core has invalided the CASing cores cache line. In this scenario, when a core finishes a CAS and releases the cache lock, three threads are competing for the lock, two on another core, one on the same core. So two thirds of the time we need to re-read memory at the start of CAS; one third of the time we do not. So we should be FASTER. But we are in fact SLOWER. <pre><code>0% memory re-reading gives 33,474,744.4 total ops per second (two threads, same core) 66% memory re-reading, gives 11,110,589.8 total ops per second (four threads, two per core) 100% memory re-reading, gives 13,640,128.4 total ops per second (two threads, one per core) </code></pre> And this puzzles me. The observed facts do not fit the theory.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POWhat are the CPU-internal characteristics of CAS collision?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. This table or related slice is empty.
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. This table or related slice is empty.
CommentsPostId
1. COOne note; the writeback of the CAS value to memory will cause it to be updated in the L3 unified cache, so the read by a second core will in fact only be from L3 cache, not from memory.
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.