StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>A few things:</p> <ol> <li>You left out the code you used to declare your global arrays on the device. It would be helpful to have this info. </li> <li>Your algorithm is not thread-safe when multiple blocks are used. In other words, if you are running multiple blocks, not only would they be doing redundant work (thus giving you no gains), but they would also likely at some point try to write to the same global memory locations, creating errors. </li> <li>Your code is thus correct when only one block is used, but this makes it rather pointless ... you're running a serial, or lightly-threaded operation on a parallel device. You cannot run on all your available resources (multiple blocks on multiple SMPs without memory conflicts (see below)...</li> </ol> <p>Currently there are two main issues with this code from a parallel standpoint:<br/></p> <ol> <li><p><code>int i = (threadIdx.x)+2;</code> ...yields a starting index of <code>2</code> for a single thread; <code>2</code> and <code>3</code> for two threads in a <strong>single</strong> block, and so on. I doubt this is what you want as the first two positions (<code>0</code>, <code>1</code>) are never getting addressed. (Remember, arrays start at index <code>0</code> in C.) </p></li> <li><p>Further, if you include <strong>multiple</strong> blocks (say 2 blocks each with one thread) then you would have multiple duplicate indices (e.g. for 2 b x 1 t --> indices b1t1: <code>2</code>, b1t2: <code>2</code>), which when you used the index to write to global memory would create conflicts and errors. Doing something like <code>int i = threadIdx.x + blockDim.x * blockIdx.x;</code> would be the typical way to correctly calculate your indices so as to avoid this issue.</p></li> <li><p>Your final expression <code>i += blockDim.x * gridDim.x;</code> is okay, because its adds a number equivalent to the total # of threads to i and thus does not create additional clashing or overlap.</p></li> <li>Why use the GPU to shuffle memory and do a trivial computation? You may not see much speedup versus a fast CPU, when you factor in the time to take your arrays onto and off of the device.</li> </ol> <p>Work on problems 1 and 2 if you wish, but beyond that consider your overall goal and what exactly kind of algorithm you are trying to optimize and come up with a more parallel-friendly solution -- or consider whether GPU computing really makes sense for your problem.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload