Note that there are some explanatory texts on larger screens.

plurals
  1. PONormalize a bunch of vectors using Nvidia's Thrust library
    primarykey
    data
    text
    <p>I just learned about Nvidia's thrust library. Just to try it wrote a small example which is supposed to normalize a bunch of vectors.</p> <pre><code>#include &lt;cstdio&gt; #include &lt;thrust/transform.h&gt; #include &lt;thrust/device_vector.h&gt; #include &lt;thrust/host_vector.h&gt; struct normalize_functor: public thrust::unary_function&lt;double4, double4&gt; { __device__ __host__ double4 operator()(double4 v) { double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z); v.x /= len; v.y /= len; v.z /= len; printf("%f %f %f\n", v.x, v.y, v.z); } }; int main() { thrust::host_vector&lt;double4&gt; v(2); v[0].x = 1; v[0].y = 2; v[0].z = 3; v[1].x = 4; v[1].y = 5; v[1].z = 6; thrust::device_vector&lt;double4&gt; v_d = v; thrust::for_each(v_d.begin(), v_d.end(), normalize_functor()); // This doesn't seem to copy back v = v_d; // Neither this does.. thrust::host_vector&lt;double4&gt; result = v_d; for(int i=0; i&lt;v.size(); i++) printf("[ %f %f %f ]\n", result[i].x, result[i].y, result[i].z); return 0; } </code></pre> <p>The example above seems to work, however I'm unable to copy the data back.. I thought a simple assignment would invoke a cudaMemcpy. It works to copy the data from the host to the device but not back???</p> <p>Secondly I'm not sure if I do this the right way. The documentation of <a href="http://wiki.thrust.googlecode.com/hg/html/group__modifying.html" rel="nofollow">for_each</a> says: </p> <blockquote> <p>for_each applies the function object f to each element in the range [first, last); f's return value, if any, is ignored.</p> </blockquote> <p>But the unary_function struct template expects two template arguments (one for the return value) and forces the operator() to also return a value, this results in a warning when compiling. I don't see how I'm supposed to write an unary functor with no return value.</p> <p>Next is the data arrangement. I just chose double4 since this will result in two fetch instructions ld.v2.f64 and ld.f64 IIRC. However I'm wondering how thrust fetches data under the hood (and how many cuda threads/blocks) are created. If I would chose instead a struct of 4 vectors would it be able to fetch data in a coalesced way.</p> <p>Finally thrust provides tuples. What about an array of tuples? How would the data be arranged in this case.</p> <p>I looked through the examples, but I haven't found an example which explains which data structure to choose for a bunch of vectors (the dot_products_with_zip.cu example says something about "structure of arrays" instead of "arrays of structures" but I see no structures used in the example. </p> <p><strong>Update</strong></p> <p>I fixed the code above and tried to run a larger example, this time normalizing 10k vectors.</p> <pre><code>#include &lt;cstdio&gt; #include &lt;thrust/transform.h&gt; #include &lt;thrust/device_vector.h&gt; #include &lt;thrust/host_vector.h&gt; struct normalize_functor { __device__ __host__ void operator()(double4&amp; v) { double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z); v.x /= len; v.y /= len; v.z /= len; } }; int main() { int n = 10000; thrust::host_vector&lt;double4&gt; v(n); for(int i=0; i&lt;n; i++) { v[i].x = rand(); v[i].y = rand(); v[i].z = rand(); } thrust::device_vector&lt;double4&gt; v_d = v; thrust::for_each(v_d.begin(), v_d.end(), normalize_functor()); v = v_d; return 0; } </code></pre> <p>Profiling with computeprof shows me a low occupancy and non-coalesced memory access:</p> <pre><code>Kernel Occupancy Analysis Kernel details : Grid size: 23 x 1 x 1, Block size: 448 x 1 x 1 Register Ratio = 0.984375 ( 32256 / 32768 ) [24 registers per thread] Shared Memory Ratio = 0 ( 0 / 49152 ) [0 bytes per Block] Active Blocks per SM = 3 / 8 Active threads per SM = 1344 / 1536 Potential Occupancy = 0.875 ( 42 / 48 ) Max achieved occupancy = 0.583333 (on 9 SMs) Min achieved occupancy = 0.291667 (on 5 SMs) Occupancy limiting factor = Block-Size Memory Throughput Analysis for kernel launch_closure_by_value on device GeForce GTX 470 Kernel requested global memory read throughput(GB/s): 29.21 Kernel requested global memory write throughput(GB/s): 17.52 Kernel requested global memory throughput(GB/s): 46.73 L1 cache read throughput(GB/s): 100.40 L1 cache global hit ratio (%): 48.15 Texture cache memory throughput(GB/s): 0.00 Texture cache hit rate(%): 0.00 L2 cache texture memory read throughput(GB/s): 0.00 L2 cache global memory read throughput(GB/s): 42.44 L2 cache global memory write throughput(GB/s): 46.73 L2 cache global memory throughput(GB/s): 89.17 L2 cache read hit ratio(%): 88.86 L2 cache write hit ratio(%): 3.09 Local memory bus traffic(%): 0.00 Global memory excess load(%): 31.18 Global memory excess store(%): 62.50 Achieved global memory read throughput(GB/s): 4.73 Achieved global memory write throughput(GB/s): 45.29 Achieved global memory throughput(GB/s): 50.01 Peak global memory throughput(GB/s): 133.92 </code></pre> <p>I wonder how I can optimized this?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload