Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>The key is to avoid the _ftol() function, which is needlessly slow. Your best bet for long lists of data like this is to use the SSE2 instruction cvtps2dq to convert two packed floats to two packed int64s. Do this twice (getting four int64s across two SSE registers) and you can shuffle them together to get four int32s (losing the top 32 bits of each conversion result). You don't need assembly to do this; MSVC exposes compiler intrinsics to the relevant instructions -- <a href="http://msdn.microsoft.com/en-us/library/d2ch40w8(VS.80).aspx" rel="nofollow noreferrer">_mm_cvtpd_epi32()</a> if my memory serves me correctly. </p> <p>If you do this it is very important that your float and int arrays be 16-byte aligned so that the SSE2 load/store intrinsics can work at maximum efficiency. Also, I recommend you software pipeline a little and process <em>sixteen</em> floats at once in each loop, eg (assuming that the "functions" here are actually calls to compiler intrinsics):</p> <pre><code>for(int i = 0; i &lt; HUGE_NUMBER; i+=16) { //int_array[i] = float_array[i]; __m128 a = sse_load4(float_array+i+0); __m128 b = sse_load4(float_array+i+4); __m128 c = sse_load4(float_array+i+8); __m128 d = sse_load4(float_array+i+12); a = sse_convert4(a); b = sse_convert4(b); c = sse_convert4(c); d = sse_convert4(d); sse_write4(int_array+i+0, a); sse_write4(int_array+i+4, b); sse_write4(int_array+i+8, c); sse_write4(int_array+i+12, d); } </code></pre> <p>The reason for this is that the SSE instructions have a long latency, so if you follow a load into xmm0 immediately with a dependent operation on xmm0 then you will have a stall. Having multiple registers "in flight" at once hides the latency a little. (Theoretically a magic all-knowing compiler could alias its way around this problem but in practice it doesn't.)</p> <p>Failing this SSE juju you can supply the /QIfist option to MSVC which will cause it to issue the single opcode <strong>fist</strong> instead of a call to _ftol; this means it will simply use whichever rounding mode happens to be set in the CPU without making sure it is ANSI C's specific truncate op. The Microsoft docs say /QIfist is deprecated because their floating point code is fast now, but a disassembler will show you that this is unjustifiedly optimistic. Even /fp:fast simply results to a call to _ftol_sse2, which though faster than the egregious _ftol is still a function call followed by a latent SSE op, and thus unnecessarily slow.</p> <p>I'm assuming you're on x86 arch, by the way -- if you're on PPC there are equivalent VMX operations, or you can use the magic-number-multiply trick mentioned above followed by a vsel (to mask out the non-mantissa bits) and an aligned store.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload