Note that there are some explanatory texts on larger screens.

plurals
  1. POCUDA kernel error when increasing thread number
    text
    copied!<p>I am developing a CUDA ray-plane intersection kernel.</p> <p>Let's suppose, my plane (face) struct is:</p> <pre><code>typedef struct _Face { int ID; int matID; int V1ID; int V2ID; int V3ID; float V1[3]; float V2[3]; float V3[3]; float reflect[3]; float emmision[3]; float in[3]; float out[3]; int intersects[RAYS]; } Face; </code></pre> <p>I pasted the whole struct so you can get an idea of it's size. <strong>RAYS</strong> equals <strong>625</strong> in current configuration. In the following code assume that the size of faces array is i.e. 1270 (generally - thousands).</p> <p>Now until today I have launched my kernel in a very naive way:</p> <pre><code>const int tpb = 64; //threads per block dim3 grid = (n +tpb-1)/tpb; // n - face count in array dim3 block = tpb; //.. some memory allocation etc. theKernel&lt;&lt;&lt;grid,block&gt;&gt;&gt;(dev_ptr, n); </code></pre> <p>and inside the kernel I had a loop:</p> <pre><code>__global__ void theKernel(Face* faces, int faceCount) { int offset = threadIdx.x + blockIdx.x*blockDim.x; if(offset &gt;= faceCount) return; Face f = faces[offset]; //..some initialization int RAY = -1; for(float alpha=0.0f; alpha&lt;=PI; alpha+= alpha_step ){ for(float beta=0.0f; beta&lt;=PI; beta+= beta_step ){ RAY++; //..calculation per ray in (alpha,beta) direction ... faces[offset].intersects[RAY] = ...; //some assignment </code></pre> <p>This is about it. I looped through all the directions and updated the <strong>faces</strong> array. I worked correctly, but was hardly any faster than CPU code.</p> <p>So today I tried to <em>optimize</em> the code, and launch the kernel with a much bigger number of threads. Instead of having <strong>1 thread per face</strong> I want <strong>1 thread per face's ray</strong> (meaning 625 threads work for 1 face). The modifications were simple:</p> <pre><code>dim3 grid = (n*RAYS +tpb-1)/tpb; //before launching . RAYS = 625, n = face count </code></pre> <p>and the kernel itself:</p> <pre><code>__global__ void theKernel(Face *faces, int faceCount){ int threadNum = threadIdx.x + blockIdx.x*blockDim.x; int offset = threadNum/RAYS; //RAYS is a global #define int rayNum = threadNum - offset*RAYS; if(offset &gt;= faceCount || rayNum != 0) return; Face f = faces[offset]; //initialization and the rest.. again .. </code></pre> <p>And this code does <strong>not</strong> work at all. Why? Theoretically, only the 1st thread (of the 625 per Face) should work, so why does this result in bad (hardly any) computation?</p> <p>Kind regards, e.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload