StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>The <strong>same structure</strong> <em>theory</em> might not be achievable in CUDA because the problem might not be parallelizable. That's basically due to the nature of the problem. In your device you cannot launch a <em>kernel</em> from within another <em>kernel</em>. This mechanism is called <a href="http://docs.nvidia.com/cuda/cuda-dynamic-parallelism/index.html" rel="nofollow"><code>Dynamic Parallelism</code></a> and is very recent. Compute Capability <code>1.1</code> doesn't support this. To my knowledge the Dynamic Parallelism is introduced since CUDA Kepler architecture. You'd have to make a bit of research to check out which devices support this (of course if you are interested). Summing up, you <strong>won't</strong> be able to achieve this with the <strong>same structure</strong> <em>theory</em>. But that <strong>doesn't</strong> mean <strong>you cannot achieve</strong> it at all. Here are my recommendations in order to port your, and any other, program:</p> <ol> <li>Read <a href="http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html" rel="nofollow">CUDA C Programming Guide</a> and <a href="http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html" rel="nofollow">CUDA C Best Practices Guide</a> (assuming you use CUDA C)</li> <li>Restructure/rethink the original problem and see if it can be parallelized.</li> <li>Perform a static analysis of your code. (basically reading the code and according you programming knowledge make things faster)</li> <li>Perform a dynamic analysis of your code. You can achieve this through tools. I would recommend <a href="http://www.valgrind.org" rel="nofollow">Valgrind</a>. It has wide usage, it's free, it has a lot of different modules which help you inspect different aspects of your program, and it's supported in a lot of platforms. I used it and I think is good</li> <li>After this two analysis you look for problematic points in your program, e.g. that take most of the execution time of the program.</li> <li>Try to parallelize those point. As I said the structure <strong>doesn't</strong> have to be the same.</li> </ol> <p>Note#1: As your a newbie the first two reading are mandatory otherwise you'd spend a lot in debugging. Note#2: If you don't find problematic points in your program I would highly doubt you could speed up your code with CUDA. But this is an extreme case, I would say.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload