Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p><strong>Disclaimer</strong></p> <p>Note that this answer contains more questions than answers. Also note that I'm guessing a lot because I don't get huge parts of your question and source code.</p> <p><strong>Reconstruction</strong></p> <p>So I'm guessing that your global memory is an array of <code>Circle</code> structs. You seem to have optimized loading these circles by loading each of their floats separately into shared memory. This way you get continuous access patterns instead of strided ones. Am I still correct here?</p> <p>So now that you have loaded <code>blockDim.x</code> circles into shared memory cooperatively you want to read a circle <code>c</code> from it for each thread, You seem to have tried 3 different ways:</p> <ol> <li>loading <code>c</code> from strided shared memory<br> (<code>c.prevX = smem[threadIdx.x + blockDim.x * 2];</code>, etc.)</li> <li>loading <code>c</code> directly from shared memory<br> (<code>c = *((Circle*)(smem + threadIdx * SMEM));</code>)</li> <li>loading <code>c</code> directly from global memory<br> (<code>c = cOut[j];</code>)</li> </ol> <p>Still correct?</p> <p><strong>Evaluation</strong></p> <ol> <li>doesn't make any sense when you load circles into shared memory like the way I described before. So you probably have tried a different loading pattern there. Something along the lines of <code>[threadId.x * 8 + 0]</code> as noted in your comment. This solution has the benefit of continuous global access but storing into smem using ank conflicts.</li> <li>is no better because it has bank conflict when reading into registers.</li> <li>is worse because of strided global memory access.</li> </ol> <p><strong>Answer</strong></p> <p>Bank conflicts are easily resolved by inserting dummy values. Instead of using <code>[threadId.x * 8 + 0]</code> you would use <code>[threadId.x * 9 + 0]</code>. Note that you are wasting a bit of shared memory (i.e every ninth float) to spread out the data across banks. Note that you have to do the same when loading the data into shared memory in the first place. But notice that you are still doing a lot of work to just load these <code>Circle</code> structs there. Which leads me to an</p> <p><strong>Even better answer</strong></p> <p>Just don't use an array of <code>Circle</code> structs in global memory. Invert your memory pattern by using multiple arrays of float instead. One for each component of a <code>Circle</code>. You can then simply load into registers directly.</p> <pre><code>c.x = gmem_x[j]; c.y = gmem_y[j]; ... </code></pre> <p>No more shared memory at all, less registers due to less pointer calculation, continuous global access patterns, no bank conflicts. All of it for free!</p> <p>Now you might think there's a downside to it when preparing the data on the host and getting the results back. My best (and final) guess is that it will still be much faster overall because you'll probably either launch the kernel every frame and visualize with a shader without ever transferring the data back to the host or launch the kernel multiple times in a row before downloading the results. Correct?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload