StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PONormalize a bunch of vectors using Nvidia's Thrust library
primarykey
Id
5731810
data
AcceptedAnswerId
5739009
AnswerCount
2
ClosedDate
CommentCount
5
CommunityOwnedDate
CreationDate
2011-04-20T14:18:30.390
FavoriteCount
2
LastActivityDate
2011-04-21T10:42:53.027
LastEditDate
2011-04-21T07:01:27.853
LastEditorUserId
44232
OwnerUserId
44232
ParentId
0
PostTypeId
1
Score
2
ViewCount
2196
LastEditorDisplayName
text
Body
I just learned about Nvidia's thrust library. Just to try it wrote a small example which is supposed to normalize a bunch of vectors. <pre><code>#include <cstdio> #include <thrust/transform.h> #include <thrust/device_vector.h> #include <thrust/host_vector.h> struct normalize_functor: public thrust::unary_function<double4, double4> { __device__ __host__ double4 operator()(double4 v) { double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z); v.x /= len; v.y /= len; v.z /= len; printf("%f %f %f\n", v.x, v.y, v.z); } }; int main() { thrust::host_vector<double4> v(2); v[0].x = 1; v[0].y = 2; v[0].z = 3; v[1].x = 4; v[1].y = 5; v[1].z = 6; thrust::device_vector<double4> v_d = v; thrust::for_each(v_d.begin(), v_d.end(), normalize_functor()); // This doesn't seem to copy back v = v_d; // Neither this does.. thrust::host_vector<double4> result = v_d; for(int i=0; i<v.size(); i++) printf("[ %f %f %f ]\n", result[i].x, result[i].y, result[i].z); return 0; } </code></pre> The example above seems to work, however I'm unable to copy the data back.. I thought a simple assignment would invoke a cudaMemcpy. It works to copy the data from the host to the device but not back??? Secondly I'm not sure if I do this the right way. The documentation of <a href="http://wiki.thrust.googlecode.com/hg/html/group__modifying.html" rel="nofollow">for_each</a> says: <blockquote> for_each applies the function object f to each element in the range [first, last); f's return value, if any, is ignored. </blockquote> But the unary_function struct template expects two template arguments (one for the return value) and forces the operator() to also return a value, this results in a warning when compiling. I don't see how I'm supposed to write an unary functor with no return value. Next is the data arrangement. I just chose double4 since this will result in two fetch instructions ld.v2.f64 and ld.f64 IIRC. However I'm wondering how thrust fetches data under the hood (and how many cuda threads/blocks) are created. If I would chose instead a struct of 4 vectors would it be able to fetch data in a coalesced way. Finally thrust provides tuples. What about an array of tuples? How would the data be arranged in this case. I looked through the examples, but I haven't found an example which explains which data structure to choose for a bunch of vectors (the dot_products_with_zip.cu example says something about "structure of arrays" instead of "arrays of structures" but I see no structures used in the example. Update I fixed the code above and tried to run a larger example, this time normalizing 10k vectors. <pre><code>#include <cstdio> #include <thrust/transform.h> #include <thrust/device_vector.h> #include <thrust/host_vector.h> struct normalize_functor { __device__ __host__ void operator()(double4& v) { double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z); v.x /= len; v.y /= len; v.z /= len; } }; int main() { int n = 10000; thrust::host_vector<double4> v(n); for(int i=0; i<n; i++) { v[i].x = rand(); v[i].y = rand(); v[i].z = rand(); } thrust::device_vector<double4> v_d = v; thrust::for_each(v_d.begin(), v_d.end(), normalize_functor()); v = v_d; return 0; } </code></pre> Profiling with computeprof shows me a low occupancy and non-coalesced memory access: <pre><code>Kernel Occupancy Analysis Kernel details : Grid size: 23 x 1 x 1, Block size: 448 x 1 x 1 Register Ratio = 0.984375 ( 32256 / 32768 ) [24 registers per thread] Shared Memory Ratio = 0 ( 0 / 49152 ) [0 bytes per Block] Active Blocks per SM = 3 / 8 Active threads per SM = 1344 / 1536 Potential Occupancy = 0.875 ( 42 / 48 ) Max achieved occupancy = 0.583333 (on 9 SMs) Min achieved occupancy = 0.291667 (on 5 SMs) Occupancy limiting factor = Block-Size Memory Throughput Analysis for kernel launch_closure_by_value on device GeForce GTX 470 Kernel requested global memory read throughput(GB/s): 29.21 Kernel requested global memory write throughput(GB/s): 17.52 Kernel requested global memory throughput(GB/s): 46.73 L1 cache read throughput(GB/s): 100.40 L1 cache global hit ratio (%): 48.15 Texture cache memory throughput(GB/s): 0.00 Texture cache hit rate(%): 0.00 L2 cache texture memory read throughput(GB/s): 0.00 L2 cache global memory read throughput(GB/s): 42.44 L2 cache global memory write throughput(GB/s): 46.73 L2 cache global memory throughput(GB/s): 89.17 L2 cache read hit ratio(%): 88.86 L2 cache write hit ratio(%): 3.09 Local memory bus traffic(%): 0.00 Global memory excess load(%): 31.18 Global memory excess store(%): 62.50 Achieved global memory read throughput(GB/s): 4.73 Achieved global memory write throughput(GB/s): 45.29 Achieved global memory throughput(GB/s): 50.01 Peak global memory throughput(GB/s): 133.92 </code></pre> I wonder how I can optimized this?
Tags
<c++><vector><cuda><gpu><thrust>
Title
Normalize a bunch of vectors using Nvidia's Thrust library
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USNils
UserOwnerUserId
1. USNils
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PONormalize a bunch of vectors using Nvidia's Thrust library
 UserUserId
 USNils
 VoteTypeVoteTypeId
 VTFavorite
2. VO
 singulars
 PostPostId
 PONormalize a bunch of vectors using Nvidia's Thrust library
 UserUserId
 USAde Miller
 VoteTypeVoteTypeId
 VTFavorite
3. VO
 singulars
 PostPostId
 PONormalize a bunch of vectors using Nvidia's Thrust library
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.