StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Depending on the dimensions of your block the first condition <code>threadIdx.x < 64</code> (note the <code>.x</code>) may not cause any divergence at all. For example, if you have a block with dimensions <code>(128,1,1)</code> then the first two warps (32-threads groups which execute in lock-step) will enter into the <code>if</code> block while the last two will bypass it. Since the <em>whole</em> warp goes one way or the other there is no divergence.</p> <p>A conditional like <code>threadIdx.x == 1</code> will cause divergence, but it will have very modest cost. Indeed, in many cases CUDA will be able to implement the conditional expression with a single instruction. For instance, operations like <code>min</code>, <code>max</code>, and <code>abs</code> will generally be implemented with a single instruction and cause absolutely no divergence. You can read about such instructions in the <a href="http://docs.google.com/viewer?a=v&q=cache:DMAehF-4h8EJ:developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/ptx_isa_2.0.pdf+cuda+ptx+2.0&hl=en&gl=us&pid=bl&srcid=ADGEEShfr9tgZNFPpnMZJmPycelnsLK2uYHXkwyQzVvW8iRjMMtGXI3JuEZ9fhdp2teS3h6F9ZzHyumFm93m6PUqyW4-hWBFMQ1gP_F_-gICly1bES3HRSOj6RfKAuX5ZjVOLvuMen4R&sig=AHIEtbQQaiJlvLHXhy7_uPXPH8aP8HdvOA" rel="noreferrer">PTX Manual</a>.</p> <p>In general you should not be overly concerned about modest amounts of control-flow divergence like the above. Where divergence will bite you in in situations like</p> <pre><code>if (threadIdx.x % 4 == 0) // do expensive operation else if (threadIdx.x % 4 == 1) // do expensive operation else if (threadIdx.x % 4 == 2) // do expensive operation else // do expensive operation </code></pre> <p>where an "expensive operation" would be one that required 10s or 100s of instructions. In this case the divergence caused by the <code>if</code> statements would reduce efficiency by 75%.</p> <p>Keep in mind that thread divergence is a much lesser concern than (1) high-level algorithm choices and (2) memory locality/coalescing. Very few CUDA programmers should ever be concerned with the sort of divergence in your examples.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload