Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>The first thing I would suspect is alignment. You may want to experiment with:</p> <pre><code>__attribute__ ((align (16))) float ...[maxsize]; </code></pre> <p>Or make sure that <code>maxsize</code> is a multiple of 16. That could definitely cause a 10% hit if in one configuration you're aligned and in another you're not. Vector operations can be extremely sensitive to this.</p> <p>The next major issue you may have is a huge stack (assuming <code>maxsize</code> is fairly large). ARM can deal with numbers less than 4k much more efficiently than it can deal with numbers larger than 4k (because it can only deal with 12-bit immediate values). So depending on the how the compiler has optimized it, pushing amparray way down on the stack could lead to more complicated math to access it.</p> <p>When small twiddly things lead to big performance changes, I always recommend pulling up the assembly (Product>Generate Output>Assembly) and seeing what's changes in the compiler output. I also highly recommend <a href="http://www.coranac.com/tonc/text/asm.htm" rel="nofollow">Whirlwind Tour of ARM Assembly</a> to get you started understanding what you're looking at. (Make sure you set the output to "For Archiving" so you see the optimized result.)</p> <p>You should also do a few more things:</p> <ul> <li><p>Try rewriting this routine as simple C instead of using Accelerate. Yes, I know Accelerate is always faster, except it's not. All those function calls are quite expensive, and the compiler can often better vectorize simple multiplication and addition that Accelerate can in my experience. This is particularly true if your stride is 1, your vectors are not enormous, and you're on a 1-2 core device like an iPad. The moment you have code that handles a stride (if you don't need a stride), it's more complicated (slower) than the code you would have written by hand. In my experience, Accelerate does seem to be very good at ramps and transcendentals (cosines of big tables for example), but not nearly so good at simple vector and matrix math.</p></li> <li><p>If this code <em>really</em> matters to you, I've found that hand-writing the assembly can definitely out-pace the compiler. I'm not even that good at ARM assembler, and I've been able to beat the compiler by 2x on simple matrix math (and the compiler crushed Accelerate). I'm particularly talking about your loop here that seems to be doing just adds and multiplies. Handwriting the assembly is a pain of course, and you then have to maintain a C version for the assembler, but when it really matters it's really fast.</p></li> </ul>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload