StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Interesting problem.</p> <p>Your problem definition has basically enforced unpredictable access to in[x,y] - since any vector might be provided. Assuming the vector image tends to refer to local pixels, the very first optimisation would be to ensure that you traverse the memory in a suitable order to make the most of cache locality. This might mean scanning 32*32 blocks in the "for each pixel" loop so that in[x,y] hits the same pixels as often as possible in a short time.</p> <p>Most likely the performance of your algorithm is going to be bound by two things</p> <ol> <li>How fast you can load <code>vectors[x,y]</code> and <code>in[x,y]</code> from main memory</li> <li>How long it takes to do the multiplications and sums</li> </ol> <p>There are SSE instructions that can multiply several elements together at a time and then add them together (multiply and accumulate). What you should do is compute </p> <pre><code>af = (1 - xf) * ( 1 - yf ) bf = ( xf) * ( 1 - yf ) cf = (1 - xf) * ( yf ) df = ( xf) * ( yf ) </code></pre> <p>and then compute</p> <pre><code>a *= af b *= bf c *= cf d *= cf return (a + b + c + d) </code></pre> <p>There is a good chance that both of these steps could be accomplished with a surprisingly small number of SSE instructions (depending on your pixel representation).</p> <p>I think that caching intermediate values is very unlikely to be useful - it seems extremely unlikely that >1% of the vector requests will point to exactly the same place, and caching stuff will cost you much more in memory bandwidth than it will save.</p> <p>If you use the prefetch instructions on your cpu to prefetch <code>in[vectors[x+1, y]]</code> as you process <code>vectors[x,y]</code> you might improve memory performance, there is no way the CPU will be able to predict your random walk around memory otherwise.</p> <p>The final way to improve the performance of your algorithm is to handle chunks of input pixels at once, i.e <code>x[0..4], x[5..8]</code> - this lets you unroll the inner maths loops. However, you are so likely to be memory bound that this won't help.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload