StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
20718006
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
1
CommunityOwnedDate
CreationDate
2013-12-21T10:31:54.453
FavoriteCount
0
LastActivityDate
2013-12-21T10:31:54.453
LastEditDate
LastEditorUserId
0
OwnerUserId
642532
ParentId
20715747
PostTypeId
2
Score
1
ViewCount
0
LastEditorDisplayName
text
Body
Disclaimer Note that this answer contains more questions than answers. Also note that I'm guessing a lot because I don't get huge parts of your question and source code. Reconstruction So I'm guessing that your global memory is an array of <code>Circle</code> structs. You seem to have optimized loading these circles by loading each of their floats separately into shared memory. This way you get continuous access patterns instead of strided ones. Am I still correct here? So now that you have loaded <code>blockDim.x</code> circles into shared memory cooperatively you want to read a circle <code>c</code> from it for each thread, You seem to have tried 3 different ways: <ol> <li>loading <code>c</code> from strided shared memory (<code>c.prevX = smem[threadIdx.x + blockDim.x * 2];</code>, etc.)</li> <li>loading <code>c</code> directly from shared memory (<code>c = *((Circle*)(smem + threadIdx * SMEM));</code>)</li> <li>loading <code>c</code> directly from global memory (<code>c = cOut[j];</code>)</li> </ol> Still correct? Evaluation <ol> <li>doesn't make any sense when you load circles into shared memory like the way I described before. So you probably have tried a different loading pattern there. Something along the lines of <code>[threadId.x * 8 + 0]</code> as noted in your comment. This solution has the benefit of continuous global access but storing into smem using ank conflicts.</li> <li>is no better because it has bank conflict when reading into registers.</li> <li>is worse because of strided global memory access.</li> </ol> Answer Bank conflicts are easily resolved by inserting dummy values. Instead of using <code>[threadId.x * 8 + 0]</code> you would use <code>[threadId.x * 9 + 0]</code>. Note that you are wasting a bit of shared memory (i.e every ninth float) to spread out the data across banks. Note that you have to do the same when loading the data into shared memory in the first place. But notice that you are still doing a lot of work to just load these <code>Circle</code> structs there. Which leads me to an Even better answer Just don't use an array of <code>Circle</code> structs in global memory. Invert your memory pattern by using multiple arrays of float instead. One for each component of a <code>Circle</code>. You can then simply load into registers directly. <pre><code>c.x = gmem_x[j]; c.y = gmem_y[j]; ... </code></pre> No more shared memory at all, less registers due to less pointer calculation, continuous global access patterns, no bank conflicts. All of it for free! Now you might think there's a downside to it when preparing the data on the host and getting the results back. My best (and final) guess is that it will still be much faster overall because you'll probably either launch the kernel every frame and visualize with a shader without ever transferring the data back to the host or launch the kernel multiple times in a row before downloading the results. Correct?
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POSolving collisions - try to coalesce gmem access, using smem, but banks conflicts
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USLumpN
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POSolving collisions - try to coalesce gmem access, using smem, but banks conflicts
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
CommentsPostId
1. COYes, you clearly understand me. I tried using dummy value, as you suggested. I got right answer (right rendering), but I didn't get any acceleration or deceleration. Can it be so because of sizeof(Circle)==32b? (or I am doing it wrong? I will add new version to the question) I mean, the size of global memory coalesced access (as I have read in the articles) is 64b for float (my case), 128b for float2, 256b for float3 and float4? I read before about SoA vs AoS pattern, but in my project, I can't recreate it using SoA (for some reasons).
 singulars
 PostPostId
 PO
 UserUserId
 USNexen

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.