Note that there are some explanatory texts on larger screens.

plurals
  1. POSlow random memory access in OpenCL (GPU)
    primarykey
    data
    text
    <p>Basically, I'm writing an OpenCL kernel which accesses global memory in a random/unpredictable manner (currently-unoptimized path tracing component of a ray tracer), and this is almost completely negating the GPU's parallelized performance advantage over the CPU (for reference, I'm running on an i7-2630QM CPU, GTX 560m GPU - performance figures below). For ease of tweaking/testing, I've written a "test" kernel which simulates this memory access pattern; it essentially provides the GPU with a large array of triangle coordinates and a list of indeces to process - for each index, it will run a ray-triangle intersection on that triangle and the 63 after, mimicking iteration through objects in an octree.</p> <p>I've tried a whole slew of optimizations, including: coalesced memory access, using read-only texture memory in place of "global", loop unrolling, tweaking work group sizes and thread distribution, local memory and barriers, and manually inlining functions. These all have provided, at best, incremental performance improvements. Sorting the indeces prior to running the kernel does speed it up significantly, but for octree traversal this would require an on-GPU re-sort every iteration, and combined with other factors, leaves me doubting it will be of much help there.</p> <p>I'm trying to figure out if there's some major hole that can be mended - misuse of data types, unseen optimizations, drivers too old (using OpenCL 1.0, which doesn't allow 1d textures), etc - or if I'm expecting too much of a performance increase, given the hardware I'm working with (various optimizations on the ray tracing side of things remain to be done, but I'd like to get this more general issue squared away before delving into that). Greatly appreciate any insight or suggestions in advance.</p> <p>Performance figures (seconds) for 409,600 blocks of 64 triangles (run as 409,600 threads):</p> <pre><code>CPU (Single Thread): Unsorted: 2.21 Sorted: 1.48 GPU: Sorted Unsorted Texture 0.07 0.15 Global 0.02 0.25 </code></pre> <p>Code:</p> <pre><code>#define IMG_WIDTH_MINUS_ONE 32767 #define IMG_HEIGHT_LOG_2 15 #define SUB(dest,v1,v2) \ dest[0]=v1[0]-v2[0]; \ dest[1]=v1[1]-v2[1]; \ dest[2]=v1[2]-v2[2]; #define EPSILON 0.00001 #define CROSS(dest,v1,v2) \ dest[0]=v1[1]*v2[2]-v1[2]*v2[1]; \ dest[1]=v1[2]*v2[0]-v1[0]*v2[2]; \ dest[2]=v1[0]*v2[1]-v1[1]*v2[0]; #define DOT(v1,v2) (v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2]) __kernel void square( __global int4 *inputIndeces, __read_only image2d_t image, __global float* output, const unsigned int count) { int global_id = get_global_id(0); float r_orig[3]; float r_dir[3]; float4 trianglePoints[3]; int cpuStartIndex = inputIndeces[global_id].x; int outputIndex = inputIndeces[global_id].w; output[outputIndex] = 0.0; r_orig[0] = 0.0; r_orig[1] = 0.0; r_orig[2] = 500.0; r_dir[0] = 0.0; r_dir[1] = 0.0; local int counter; counter = 0; r_dir[2]= -1.0; float tvec[3], pvec[3], qvec[3], edgeA[3], edgeB[3]; float det, inv_det, t, u, v; #pragma unroll 64 for (int ind=cpuStartIndex;ind&lt;cpuStartIndex+64;++ind) { int tIndex = ind&lt;&lt;2; int2 coords[3]; coords[0] = (int2)(tIndex &amp; IMG_WIDTH_MINUS_ONE,tIndex &gt;&gt; IMG_HEIGHT_LOG_2); coords[1] = (int2)((tIndex + 1) &amp; IMG_WIDTH_MINUS_ONE,(tIndex + 1) &gt;&gt; IMG_HEIGHT_LOG_2); coords[2] = (int2)((tIndex + 2) &amp; IMG_WIDTH_MINUS_ONE,(tIndex + 2) &gt;&gt; IMG_HEIGHT_LOG_2); trianglePoints[0] = read_imagef(image, sampler, coords[0]); trianglePoints[1] = read_imagef(image, sampler, coords[1]); trianglePoints[2] = read_imagef(image, sampler, coords[2]); edgeA[0] = (trianglePoints[0].w - trianglePoints[0].x); edgeA[1] = (trianglePoints[1].x - trianglePoints[0].y); edgeA[2] = (trianglePoints[1].y - trianglePoints[0].z); edgeB[0] = (trianglePoints[1].z - trianglePoints[0].x); edgeB[1] = (trianglePoints[1].w - trianglePoints[0].y); edgeB[2] = (trianglePoints[2].x - trianglePoints[0].z); CROSS(pvec,r_dir,edgeB); det = DOT(edgeA, pvec); if (det &gt; -EPSILON &amp;&amp; det &lt; EPSILON) { continue; } inv_det = 1.0 / det; tvec[0] = r_orig[0] - trianglePoints[0].x; tvec[1] = r_orig[1] - trianglePoints[0].y; tvec[2] = r_orig[2] - trianglePoints[0].z; u = DOT(tvec, pvec) * inv_det; if (u &lt; 0.0 || u &gt; 1.0) { continue; } CROSS(qvec,tvec,edgeA); v = DOT(r_dir, qvec) * inv_det; if (v &lt; 0.0 || u + v &gt; 1.0) { continue; } t = DOT(edgeB, qvec) * inv_det; if (t &gt; 0.001) { ++counter; } else { continue; } } output[outputIndex] = (float)counter; } </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload