Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to optimize OpenCL code for neighbors accessing?
    primarykey
    data
    text
    <p><strong>Edit</strong>: Proposed solutions results are added at the end of the question.</p> <p>I'm starting to program with OpenCL, and I have created a naive implementation of my problem.</p> <p>The theory is: I have a 3D grid of elements, where each elements has a bunch of information (around 200 bytes). Every step, every element access its neighbors information and accumulates this information to prepare to update itself. After that there is a step where each element updates itself with the information gathered before. This process is executed iteratively.</p> <p>My OpenCL implementation is: I create an OpenCL buffer of 1 dimension, fill it with structs representing the elements, wich have an "int neighbors[ 6 ] " where I store the index of the neighbors in the Buffer. I launch a kernel that consults the neighbors and accumulate their information into element variables not consulted in this step, and then I launch another kernel that uses this variables to update the elements. These kernels use __global variables only.</p> <p>Sample code:</p> <pre><code>typedef struct{ float4 var1; float4 var2; float4 nextStepVar1; int neighbors[8]; int var3; int nextStepVar2; bool var4; } Element; __kernel void step1(__global Element *elements, int nelements){ int id = get_global_id(0); if (id &gt;= nelements){ return; } Element elem = elements[id]; for (int i=0; i &lt; 6; ++i){ if (elem.neighbors[i] != -1){ //Gather information of the neighbor and accumulate it in elem.nextStepVars } } elements[id] = elem; } __kernel void step2(__global Element *elements, int nelements){ int id = get_global_id(0); if (id &gt;= nelements){ return; } Element elem = elements[id]; //update elem variables by using elem.nextStepVariables //restart elem.nextStepVariables } </code></pre> <p>Right now, my OpenCL implementation takes basically the same time than my C++ implementation.</p> <p>So, the question is: How would you (the experts :P) address this problem? I have read about 3D images, to store the information and change the neighborhood accessing pattern by changing the NDRange to a 3D one. Also, I have read about __local memory, to first load all the neighborhood in a workgroup, synchronize with a barrier and then use them, so that accesses to memory are reduced.</p> <p>Could you give me some tips to optimize a process like the one I described, and if possible, give me some snippets?</p> <p>Thanx.</p> <p><strong>Edit</strong>: Third and fifth optimizations proposed by <a href="https://stackoverflow.com/users/1470092/huseyin-tugrul-buyukisik">Huseyin Tugrul</a> were already in the code. As mentioned <a href="https://stackoverflow.com/questions/8994219/i-need-help-understanding-data-alignment-in-opencls-buffers">here</a>, to make structs behave properly, they need to satisfy some restrictions, so it is worth understanding that to avoid headaches.</p> <p><strong>Edit 1</strong>: Applying the seventh optimization proposed by <a href="https://stackoverflow.com/users/1470092/huseyin-tugrul-buyukisik">Huseyin Tugrul</a> performance increased from 7 fps to 60 fps. In a more general experimentation, the performance gain was about x8.</p> <p><strong>Edit 2</strong>: Applying the first optimization proposed by <a href="https://stackoverflow.com/users/1470092/huseyin-tugrul-buyukisik">Huseyin Tugrul</a> performance increased about x1.2 . I think that the real gain is higher, but hides because of another bottleneck not yet solved.</p> <p><strong>Edit 3</strong>: Applying the 8th and 9th optimizations proposed by <a href="https://stackoverflow.com/users/1470092/huseyin-tugrul-buyukisik">Huseyin Tugrul</a> didn't change performance, because of the lack of significant code taking advantage of these optimizations, worth trying in other kernels though.</p> <p><strong>Edit 4</strong>: Passing invariant arguments (such as n_elements or workgroupsize) to the kernels as #DEFINEs instead of kernel args, as mentioned <a href="http://developer.amd.com/wordpress/media/2012/10/Optimizations-ImageConvolution1.pdf" rel="nofollow noreferrer">here</a>, increased performance around x1.33. As explained in the document, this is because of the aggressive optimizations that the compiler can do when knowing the variables at compile-time.</p> <p><strong>Edit 5</strong>: Applying the second optimization proposed by <a href="https://stackoverflow.com/users/1470092/huseyin-tugrul-buyukisik">Huseyin Tugrul</a>, but using 1 bit per neighbor and using bitwise operations to check if neighbor is present (so, if neighbors &amp; 1 != 0, top neighbor is present, if neighbors &amp; 2 != 0, bot neighbor is present, if neighbors &amp; 4 != 0, right neighbor is present, etc), increased performance by a factor of x1.11. I think this was mostly because of the data transfer reduction, because the data movement was, and keeps being my bottleneck. Soon I will try to get rid of the dummy variables used to add padding to my structs.</p> <p><strong>Edit 6</strong>: By eliminating the structs that I was using, and creating separated buffers for each property, I eliminated the padding variables, saving space, and was able to optimice the global memory access and local memory allocation. Performance increased by a factor of x1.25, wich is very good. Worth doing this, despite the programation complexity and unreadability :P.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload