Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>The first thing I would suspect is alignment. You may want to experiment with:</p> <pre><code>__attribute__ ((align (16))) float ...[maxsize]; </code></pre> <p>Or make sure that <code>maxsize</code> is a multiple of 16. That could definitely cause a 10% hit if in one configuration you're aligned and in another you're not. Vector operations can be extremely sensitive to this.</p> <p>The next major issue you may have is a huge stack (assuming <code>maxsize</code> is fairly large). ARM can deal with numbers less than 4k much more efficiently than it can deal with numbers larger than 4k (because it can only deal with 12-bit immediate values). So depending on the how the compiler has optimized it, pushing amparray way down on the stack could lead to more complicated math to access it.</p> <p>When small twiddly things lead to big performance changes, I always recommend pulling up the assembly (Product>Generate Output>Assembly) and seeing what's changes in the compiler output. I also highly recommend <a href="http://www.coranac.com/tonc/text/asm.htm" rel="nofollow">Whirlwind Tour of ARM Assembly</a> to get you started understanding what you're looking at. (Make sure you set the output to "For Archiving" so you see the optimized result.)</p> <p>You should also do a few more things:</p> <ul> <li><p>Try rewriting this routine as simple C instead of using Accelerate. Yes, I know Accelerate is always faster, except it's not. All those function calls are quite expensive, and the compiler can often better vectorize simple multiplication and addition that Accelerate can in my experience. This is particularly true if your stride is 1, your vectors are not enormous, and you're on a 1-2 core device like an iPad. The moment you have code that handles a stride (if you don't need a stride), it's more complicated (slower) than the code you would have written by hand. In my experience, Accelerate does seem to be very good at ramps and transcendentals (cosines of big tables for example), but not nearly so good at simple vector and matrix math.</p></li> <li><p>If this code <em>really</em> matters to you, I've found that hand-writing the assembly can definitely out-pace the compiler. I'm not even that good at ARM assembler, and I've been able to beat the compiler by 2x on simple matrix math (and the compiler crushed Accelerate). I'm particularly talking about your loop here that seems to be doing just adds and multiplies. Handwriting the assembly is a pain of course, and you then have to maintain a C version for the assembler, but when it really matters it's really fast.</p></li> </ul>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. CORecently I changed my analysis blocks, and realize now I could decrease maxsize considerably - which may make this a useless exercise. Originally all of the code was in straight C, and was easily 10x slower than when I replaced it with Accelerate (vdsp_vmma and vvsinf are vastly faster than C) - again when the blocks were larger, and the iterated array lengths and index were variable (neither are true now). The timing I have now is acceptable, and I'm not sure ASM would be worth the time. However, I am curious about the alignment -I have tried that attribute to no effect. Any thoughts on this?
      singulars
    2. COThanks for the note on vdsp_vmma. I definitely agree about vvsinf; had not profiled vmma. Alignment doesn't always win if the data is already aligned (which can happen by accident or optimization), or if the particular function doesn't rely on it too much. Sometimes they do a separate iteration to handle the leading unaligned values until they can get onto aligned data. Sometimes they just don't require aligned data.
      singulars
    3. COThank you. I'll give aligning another shot, and playing with maxsize. The whole reason I went to Accelerate was the block size was -huge-. And I was just spending too much time iterating. Now that that I've headed to a smaller block, straight C may be faster without the overhead. However, I won't remove it now, as it makes the function more readable (again, not all of my code is shown - but the missing lines are all simple arithmetic).
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload