Note that there are some explanatory texts on larger screens.

plurals
  1. POSolving collisions - try to coalesce gmem access, using smem, but banks conflicts
    text
    copied!<p>I have that code:</p> <pre><code> struct __declspec(align(32)) Circle { float x, y; float prevX, prevY; float speedX, speedY; float mass; float radius; void init(const int _x, const int _y, const float _speedX = 0.0f, const float _speedY = 0.0f, const float _radius = CIRCLE_RADIUS_DEFAULT, const float _mass = CIRCLE_MASS_DEFAULT); }; </code></pre> <p>And the second one:</p> <pre><code>/*smem[threadIdx.x] = *(((float*)cOut) + threadIdx.x); smem[threadIdx.x + blockDim.x] = *(((float*)cOut) + threadIdx.x + blockDim.x); smem[threadIdx.x + blockDim.x * 2] = *(((float*)cOut) + threadIdx.x + blockDim.x * 2); smem[threadIdx.x + blockDim.x * 3] = *(((float*)cOut) + threadIdx.x + blockDim.x * 3); smem[threadIdx.x + blockDim.x * 4] = *(((float*)cOut) + threadIdx.x + blockDim.x * 4); smem[threadIdx.x + blockDim.x * 5] = *(((float*)cOut) + threadIdx.x + blockDim.x * 5); smem[threadIdx.x + blockDim.x * 6] = *(((float*)cOut) + threadIdx.x + blockDim.x * 6); smem[threadIdx.x + blockDim.x * 7] = *(((float*)cOut) + threadIdx.x + blockDim.x * 7);*/ __syncthreads(); /*float x, y; float prevX, prevY; float speedX, speedY; float mass; float radius;*/ /*c.x = smem[threadIdx.x]; c.y = smem[threadIdx.x + blockDim.x]; //there must be [threadId.x * 8 + 0] c.prevX = smem[threadIdx.x + blockDim.x * 2]; //[threadId.x * 8 + 1] and e.t.c. c.prevY = smem[threadIdx.x + blockDim.x * 3]; c.speedX = smem[threadIdx.x + blockDim.x * 4]; c.speedY = smem[threadIdx.x + blockDim.x * 5]; c.mass = smem[threadIdx.x + blockDim.x * 6]; c.radius = smem[threadIdx.x + blockDim.x * 7];*/ c = cOut[j]; //c = *((Circle*)(smem + threadIdx * SMEM)); </code></pre> <p>There is 2 gmem (I mean global memory) access: 1) Read Circle and detect collisions with it 2) Write Circle after changing it's speed and position Also I have circlesConst-massive of Circle, which was allocated by cudaMallocToSybol(). It is used to check intersection with its circles of the main circle C (it's in the register), which was read from gmem.</p> <p>As I think, I used const memory well and it gains me all its performance :') (Am I wrong?)</p> <p>When I read about coalesced access to gmem (is there coalesced access to other types of memory? I didn't find any info about it), I wanted to try it for me. As you can see, Circle-structure has 8 vars typed float = 32bits. I tried (in code it is commented) to do it, but, firstly, I get a wrong answer (because I must read from smem not correctly, mentioned below), secondly, I get 33% performance less. Why? I think, it doesn't depend on wrong fields relations.</p> <p>And the second question, as I wrote in the comment in the code near the reading from smem to C, I must read another way, but If I do so, there will be a lot of banks conflict, so I will get much less performance... So, how can I load Circles coalasced without bank conflicts and, after that, write it back?</p> <p>p.s Is the structure with size over 4*float located into the registers?</p> <hr> <p><strong>update:</strong> The newest version is:</p> <pre><code>#define CF (9) //9 because the primary struct has 8 floats, so 1 is for wasting i = blockIdx.x * blockDim.x; smem[threadIdx.x + blockDim.x * 0 + blockDim.x * 0 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 0); smem[threadIdx.x + blockDim.x * 1 + blockDim.x * 1 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 1); smem[threadIdx.x + blockDim.x * 2 + blockDim.x * 2 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 2); smem[threadIdx.x + blockDim.x * 3 + blockDim.x * 3 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 3); smem[threadIdx.x + blockDim.x * 4 + blockDim.x * 4 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 4); smem[threadIdx.x + blockDim.x * 5 + blockDim.x * 5 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 5); smem[threadIdx.x + blockDim.x * 6 + blockDim.x * 6 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 6); smem[threadIdx.x + blockDim.x * 7 + blockDim.x * 7 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 7); c.x = smem[threadIdx.x * CF + 0]; c.y = smem[threadIdx.x * CF + 1]; c.prevX = smem[threadIdx.x * CF + 2]; c.prevY = smem[threadIdx.x * CF + 3]; c.speedX = smem[threadIdx.x * CF + 4]; c.speedY = smem[threadIdx.x * CF + 5]; c.mass = smem[threadIdx.x * CF + 6]; c.radius = smem[threadIdx.x * CF + 7]; </code></pre> <p>Is it right way to coalescing gmem acces using smem? I mean, I am afraid of <code>BlockDim.x * 1 / (CF - 1) + threadIdx.x / (CF - 1)</code>. I guess, I didn't get some boost, because it isn't allow gmem to coalesce reading more than for one Circle, but I can't understand, how to make it coalescing two Circles..</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload