StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
11237235
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
11
CommunityOwnedDate
CreationDate
2012-06-28T02:14:03.600
FavoriteCount
0
LastActivityDate
2016-11-14T02:05:00.907
LastEditDate
2017-05-23T12:03:09.047
LastEditorUserId
-1
OwnerUserId
866732
ParentId
11227809
PostTypeId
2
Score
2970
ViewCount
0
LastEditorDisplayName
text
Body
The reason why performance improves drastically when the data is sorted is that the branch prediction penalty is removed, as explained beautifully in <a href="https://stackoverflow.com/users/922184/mysticial">Mysticial</a>'s answer. Now, if we look at the code <pre><code>if (data[c] >= 128) sum += data[c]; </code></pre> we can find that the meaning of this particular <code>if... else...</code> branch is to add something when a condition is satisfied. This type of branch can be easily transformed into a conditional move statement, which would be compiled into a conditional move instruction: <code>cmovl</code>, in an <code>x86</code> system. The branch and thus the potential branch prediction penalty is removed. In <code>C</code>, thus <code>C++</code>, the statement, which would compile directly (without any optimization) into the conditional move instruction in <code>x86</code>, is the ternary operator <code>... ? ... : ...</code>. So we rewrite the above statement into an equivalent one: <pre><code>sum += data[c] >=128 ? data[c] : 0; </code></pre> While maintaining readability, we can check the speedup factor. On an Intel <a href="http://en.wikipedia.org/wiki/Intel_Core#Core_i7" rel="noreferrer">Core i7</a>-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial): x86 <pre><code>// Branch - Random seconds = 8.885 // Branch - Sorted seconds = 1.528 // Branchless - Random seconds = 3.716 // Branchless - Sorted seconds = 3.71 </code></pre> x64 <pre><code>// Branch - Random seconds = 11.302 // Branch - Sorted seconds = 1.830 // Branchless - Random seconds = 2.736 // Branchless - Sorted seconds = 2.737 </code></pre> The result is robust in multiple tests. We get a great speedup when the branch result is unpredictable, but we suffer a little bit when it is predictable. In fact, when using a conditional move, the performance is the same regardless of the data pattern. Now let's look more closely by investigating the <code>x86</code> assembly they generate. For simplicity, we use two functions <code>max1</code> and <code>max2</code>. <code>max1</code> uses the conditional branch <code>if... else ...</code>: <pre><code>int max1(int a, int b) { if (a > b) return a; else return b; } </code></pre> <code>max2</code> uses the ternary operator <code>... ? ... : ...</code>: <pre><code>int max2(int a, int b) { return a > b ? a : b; } </code></pre> On a x86-64 machine, <code>GCC -S</code> generates the assembly below. <pre><code>:max1 movl %edi, -4(%rbp) movl %esi, -8(%rbp) movl -4(%rbp), %eax cmpl -8(%rbp), %eax jle .L2 movl -4(%rbp), %eax movl %eax, -12(%rbp) jmp .L4 .L2: movl -8(%rbp), %eax movl %eax, -12(%rbp) .L4: movl -12(%rbp), %eax leave ret :max2 movl %edi, -4(%rbp) movl %esi, -8(%rbp) movl -4(%rbp), %eax cmpl %eax, -8(%rbp) cmovge -8(%rbp), %eax leave ret </code></pre> <code>max2</code> uses much less code due to the usage of instruction <code>cmovge</code>. But the real gain is that <code>max2</code> does not involve branch jumps, <code>jmp</code>, which would have a significant performance penalty if the predicted result is not right. So why does a conditional move perform better? In a typical <code>x86</code> processor, the execution of an instruction is divided into several stages. Roughly, we have different hardware to deal with different stages. So we do not have to wait for one instruction to finish to start a new one. This is called <a href="http://en.wikipedia.org/wiki/Pipeline_%28computing%29" rel="noreferrer">pipelining</a>. In a branch case, the following instruction is determined by the preceding one, so we cannot do pipelining. We have to either wait or predict. In a conditional move case, the execution conditional move instruction is divided into several stages, but the earlier stages like <code>Fetch</code> and <code>Decode</code> does not depend on the result of the previous instruction; only latter stages need the result. Thus, we wait a fraction of one instruction's execution time. This is why the conditional move version is slower than the branch when prediction is easy. The book <a href="http://rads.stackoverflow.com/amzn/click/0136108040" rel="noreferrer">Computer Systems: A Programmer's Perspective, second edition</a> explains this in detail. You can check Section 3.6.6 for Conditional Move Instructions, entire Chapter 4 for Processor Architecture, and Section 5.11.2 for a special treatment for Branch Prediction and Misprediction Penalties. Sometimes, some modern compilers can optimize our code to assembly with better performance, sometimes some compilers can't (the code in question is using Visual Studio's native compiler). Knowing the performance difference between branch and conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POWhy is it faster to process a sorted array than an unsorted array?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USCommunity
UserOwnerUserId
1. USWiSaGaN
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.