StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
11550371
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
7
CommunityOwnedDate
CreationDate
2012-07-18T21:07:50.093
FavoriteCount
0
LastActivityDate
2012-07-18T21:32:53.500
LastEditDate
2012-07-18T21:32:53.500
LastEditorUserId
442006
OwnerUserId
442006
ParentId
11549036
PostTypeId
2
Score
6
ViewCount
0
LastEditorDisplayName
text
Body
There are some cases where you can get programs to run at the full potential speed on the GPU with very little porting work from a plain CPU version, and this might be one of them. If it's possible for you to have a function like this: <pre><code>void process_single_video_frame(void* part_of_frame) { // initialize variables ... intermediate_result_1 = function1(part_of_frame); intermediate_result_2 = function2(intermediate_result_1); intermediate_result_3 = function3(intermediate_result_2); store_results(intermediate_result_3); } </code></pre> and you can process many part_of_frames at the same time. Say, a few thousand, and <code>function1()</code>, <code>function2()</code> and <code>function3()</code> go through pretty much the same code paths (that is, the program flow does not depend heavily on the contents of the frame), then, local memory may do all the work for you. Local memory is a type of memory that is stored in global memory. It is different from global memory in a subtle, yet profound way... The memory is simply interleaved in such a way that adjacent threads will access adjacent 32 bit words, enabling the memory access to be fully coalesced if the threads all read from the same location of their local memory arrays. The flow of your program would be that you start out by copying <code>part_of_frame</code> to a local array and prepare other local arrays for intermediate results. You then pass pointers to the local arrays between the various functions in your code. Some pseudocode: <pre><code>const int size_of_one_frame_part = 1000; __global__ void my_kernel(int* all_parts_of_frames) { int i = blockIdx.x * blockDim.x + threadIdx.x; int my_local_array[size_of_one_frame_part]; memcpy(my_local_array, all_parts_of_frames + i * size_of_one_frame_part); int local_intermediate_1[100]; function1(local_intermediate_1, my_local_array); ... } __device__ void function1(int* dst, int* src) { ... } </code></pre> In summary, this approach may let you use your CPU functions pretty much unchanged, as the parallelism does not come from creating parallelized versions of your functions, but instead by running the entire chain of functions in parallel. And this again is made possible by the hardware support for interleaving the memory in local arrays. Notes: <ul> <li>The initial copy of the <code>part_of_frame</code> from global to local memory is not coalesced, but hopefully, you will have enough calculations to hide that.</li> <li>On devices of compute capability <= 1.3, there is only 16KiB of local memory available per thread, which may not be enough for your <code>part_of_frame</code> and the other intermediate data. But on compute capability >= 2.0, this has bee expanded to 512KiB, which should be plenty.</li> </ul>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POProper way to write kernel functions in CUDA?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USRoger Dahl
UserOwnerUserId
1. USRoger Dahl
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POProper way to write kernel functions in CUDA?
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.