StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POFast(est) way to write a seqence of integer to global memory?
primarykey
Id
18403568
data
AcceptedAnswerId
0
AnswerCount
2
ClosedDate
CommentCount
13
CommunityOwnedDate
CreationDate
2013-08-23T13:01:39.313
FavoriteCount
5
LastActivityDate
2013-08-25T09:53:32.847
LastEditDate
2013-08-23T20:49:26.837
LastEditorUserId
2188453
OwnerUserId
2188453
ParentId
0
PostTypeId
1
Score
11
ViewCount
629
LastEditorDisplayName
text
Body
The task is very simple, writting a seqence of integer variable to memory: Original code: <pre><code>for (size_t i=0; i<1000*1000*1000; ++i) { data[i]=i; }; </code></pre> Parallelized code: <pre><code> size_t stepsize=len/N; #pragma omp parallel num_threads(N) { int threadIdx=omp_get_thread_num(); size_t istart=stepsize*threadIdx; size_t iend=threadIdx==N-1?len:istart+stepsize; #pragma simd for (size_t i=istart; i<iend; ++i) x[i]=i; }; </code></pre> The performance sucks, it takes 1.6 sec to writing 1G <code>uint64</code> variables (which is equal to 5GB per sec), by simple parallelization (<code>open mp parallel</code>)of the above code, the speed increase abit, but performance still sucks, take 1.4 sec with 4 threads and 1.35 with 6 threads on a i7 3970. The theortical memory bandwidth of my rig (i7 3970/64G DDR3-1600) is 51.2 GB/sec, for the above example, the achieved memory bandwidth is only about 1/10 of the theoritcal bandwidth, even through the application is pretty much memory-bandwidth-bounded. Anyone know how to improve the code? I wrote alot of memory-bound code on GPU, its pretty easy for GPU to take full advantage of the GPU's device memory bandwidth (e.g. 85%+ of theoritcal bandwidth). EDIT: The code is compiled by Intel ICC 13.1, to 64bit binary, and with maximum optimzation (O3) and AVX code path on, as well as auto-vectorization. UPDATE: I tried all the codes below ( thanks to Paul R), nothing special happens, I believe the compiler is fully capable of doing the kind of simd/vectorization optimization. As for why I want to fill the numbers there, well, long story short: Its part of a high-performance heterogeneous computing algorthim, on the device side, the algorthim is highly efficient to the degree that the multi-GPU set is so fast such that I found the performance bottleneck happen to be when CPU try to write several seqence of numbers to memory. Of cause, knowing that CPU sucks at filling numbers (in contrast, the GPU can fill seqence of number at a speed very close (238GB/sec out of 288GB/sec on GK110 vs a pathetic 5GB/sec out of 51.2GB/sec on CPU) to the theorical bandwidth of GPU's global memory), I could change my algorthim a bit, but what make me wonder is why CPU sucks so bad at filling seqence of numbers here. As for memory bandwidth of my rig, I believe the bandwidth (51.2GB) is about correct, based on my <code>memcpy()</code> test, the achieved bandwidth is about 80%+ of the theoritical bandwidth (>40GB/sec).
Tags
<c++><c><memory><optimization>
Title
Fast(est) way to write a seqence of integer to global memory?
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USuser2188453
UserOwnerUserId
1. USuser2188453
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POFast(est) way to write a seqence of integer to global memory?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POFast(est) way to write a seqence of integer to global memory?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POFast(est) way to write a seqence of integer to global memory?
 UserUserId
 USMohamad Ali Baydoun
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.