Note that there are some explanatory texts on larger screens.

plurals
  1. POMemory layout mismatching between CPU and GPU code with CUDA
    primarykey
    data
    text
    <p>I'm experiencing a very weird situation. I have this template structures:</p> <pre><code>#ifdef __CUDACC__ #define __HOSTDEVICE __host__ __device__ #else #define __HOSTDEVICE #endif template &lt;typename T&gt; struct matrix { T* ptr; int col_size, row_size; int stride; // some host &amp; device methods }; struct dummy1 {}; struct dummy2 : dummy1 {}; template &lt;typename T&gt; struct a_functor : dummy2 { matriz&lt;T&gt; help_m; matrix&lt;T&gt; x, y; T *x_ptr, *y_ptr; int bsx, ind_thr; __HOSTDEVICE void operator()(T* __x, T* __y) { // functor code } }; </code></pre> <p>I've structured my code to separate cpp and cu files, so a_functor object is created in cpp file and used in a kernel function. The problem is that, executing operator() inside a kernel, I found some random behaviour I couldn't explain only looking at code. It was like my structs were sort of corrupted. So, calling a sizeof() on an a_functor object, I found:</p> <ul> <li><p>CPU code (.cpp and .cu outside kernel): 64 bytes</p></li> <li><p>GPU code (inside kernel): 68 bytes</p></li> </ul> <p>There was obviously some kind of mismatching that ruined the whole stuff. Going further, I tracked the distance between struct parameter pointers and struct itself - to try to inspect the produced memory layout - and here's what I found:</p> <pre><code>a_functor foo; // CPU (char*)(&amp;foo.help_m) - (char*)(&amp;foo) = 0 (char*)(&amp;foo.x) - (char*)(&amp;foo) = 16 (char*)(&amp;foo.y) - (char*)(&amp;foo) = 32 (char*)(&amp;foo.x_ptr) - (char*)(&amp;foo) = 48 (char*)(&amp;foo.y_ptr) - (char*)(&amp;foo) = 52 (char*)(&amp;foo.bsx) - (char*)(&amp;foo) = 56 (char*)(&amp;foo.ind_thr) - (char*)(&amp;foo) = 60 // GPU - inside a_functor::operator(), in-kernel (char*)(&amp;this-&gt;help_m) - (char*)(this) = 4 (char*)(&amp;this-&gt;x) - (char*)(this) = 20 (char*)(&amp;this-&gt;y) - (char*)(this) = 36 (char*)(&amp;this-&gt;x_ptr) - (char*)(this) = 52 (char*)(&amp;this-&gt;y_ptr) - (char*)(this) = 56 (char*)(&amp;this-&gt;bsx) - (char*)(this) = 60 (char*)(&amp;this-&gt;ind_thr) - (char*)(this) = 64 </code></pre> <p>I really can't understand why nvcc generated this memory layout for my struct (what are that 4 bytes supposed to be/do!?!). I thought it could be an alignment problem and I tryed to explicitly align a_functor, but I can't because it is passed by value in kernel</p> <pre><code>template &lt;typename T, typename Str&gt; __global__ void mykernel(Str foo, T* src, T*dst); </code></pre> <p>and when I try compile I get</p> <p>error: cannot pass a parameter with a too large explicit alignment to a <strong>global</strong> routine on win32 platforms</p> <p>So, to solve this strange situation (...and I do think that's an nvcc bug), what should I do? The only thing I can think of is playing with alignment and passing my struct to kernel by pointer to avoid the aforementioned error. However, I'm really wondering: why that memory layout mismatching?! It really makes no sense...</p> <p>Further information: I'm using Visual Studio 2008, compiling with MSVC on Windows XP 32bit platform. I installed the latest CUDA Toolkit 5.0.35. My card is a GeForce GTX 570 (compute capability 2.0).</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. COI built a simple app around the code you have posted. I tested this app on a linux 64 bit platform and a win7 32 bit platform, and was not able to reproduce a difference in the size of the object between the host code and kernel code in either case. In the 64 bit case, the object size was 96 bytes in both host and kernel case, and on the 32 bit win7 platform it was 64 bytes in both host and kernel case. Can you post a simple, self-contained, complete reproducer, along the lines of my test code [here](http://pastebin.com/GLwLmhNQ)?
      singulars
    2. CODear Robert, I posted a closer-to-reality version here: http://pastebin.com/qdPNa2Zs. Executing that code (mine & yours), I couldn't see the strange behaviour I have in my real code. But, even more strange, I tried to remove from real code the equivalent of that : dummy2 (i.e. derivation from a void struct). I needed it, but I'll do differently without pain. You know what? This apparently useless move solved my problem! I mean... maybe that 4bytes were "room" for a void base struct! The whole thing is still unexplicable, to me.
      singulars
    3. COPS My apologize if I forgot to add that derivation in the simplified code I posted above. Now that I know it was the source of all pains, I'm really sorry I didn't post it in the reduced version, but I corrected it accordingly.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload