Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Well, here are some thoughs:</p> <p>You perform ~640*480 iterations of <code>numpy.dot</code>, which itself processes 6*6*32 elements. Parallelizing dot-product barely worth it: 192 parallel threads is not enough for GPU, and reduction on CUDA is additional troubles. So, IMO, the best way to parallelize you task is to assign one element of output array to each thread.</p> <p>Now about memory: output array will be in global memory, there is not much choice. For input data, <code>A</code> looks quite good for texture memory, since adjacent threads access adjacent elements. Alternatively, you can manually "cache" it in shared memory, but in this case it does not look much advantageous over simply using texture. For <code>B</code>, shared memory is not good, since it would cause bank conflicts, since when you calculate dot-product, all threads in half-warp access the same B's element (you can start summation from different elements in different threads, but that's (again) doesn't look promising). So the choice is either texture or constant. I vote for constant, since (a) constant memory is suited for data which is accessed by all threads on the device, (b) you won't pollute texture cache.</p> <p>The above is just my guesses, and to actually achieve good performance you better try out different variants...</p> <p><strong>Update regarding your naive implementation</strong> </p> <pre><code>for (int Yi = 0; Yi &lt; Ydims[0]; Yi++ ) </code></pre> <p>Here, you do aceess to a global memory on each iteration. That's a <strong>huge</strong> performance killer. Since you have 3 dimensions, you better replace your <code>int *Ydims</code> with <code>int3 Ydims</code> (same for <code>Xdims</code> and <code>outdims</code>).</p> <pre><code>out[out_indx] += X[X_indx]*Y[Y_indx]; </code></pre> <p>Again, a very bad idea. Create a register variable and do all operations with it. Write to a global array only once at the end of a kernel.</p> <p>These optimizations are first thing you should do. Second thing is to make you <code>X</code> and <code>Y</code> 3D textures, so access to them will be cached. I guess, after this CUDA would outperform CPU.</p> <p>For further optimisations, you'd better read <a href="http://developer.download.nvidia.com/compute/cuda/4_0/toolkit/docs/CUDA_C_Best_Practices_Guide.pdf" rel="nofollow">CUDA C Best Practices Guide</a>. It's must read, and you would get much better idea of how to write efficient GPU code (right now your implementation is toooo naive)</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload