Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>If you keep u &amp; v interleaved in one register, and use 'pmaddwd' and precomputed constants instead of your shift-and-add approach, you can compress the conversion code to about a third, and get rid of most stalls at the same time:</p> <pre><code>; xmm0 = y y y y y y y y ; xmm3 = u v u v u v u v psubsw xmm3, [Const128] psubsw xmm0, [Const16] movdqa xmm4, xmm3 movdqa xmm5, xmm3 pmaddwd xmm3, [const_1] pmaddwd xmm4, [const_2] pmaddwd xmm5, [const_3] psrad xmm3, 14 psrad xmm4, 14 psrad xmm5, 14 pshufb xmm3, xmm3, [const_4] ; or pshuflw &amp; pshufhw pshufb xmm4, xmm4, [const_4] pshufb xmm5, xmm5, [const_4] paddsw xmm3, xmm0 paddsw xmm4, xmm0 paddsw xmm5, xmm0 </code></pre> <p>If you want it to work even faster, playing with PMADDUBSW should allow you to work on 16 pixels at a time with a small increase in complexity.</p> <p>Most processors (particularly non-Intels, notorious for not having a well-working hardware prefetcher, but, to a lesser extent, Intels too) will benefit from a prefetchnta [esi+256] thrown inside the loop. </p> <p>EDIT: the code that uses PMADDUBSW could look like this (correctness not guaranteed):</p> <pre><code>const a: times 4 db 1,3 times 4 db 5,7 const b: times 4 db 9,11 times 4 db 13,15 const_c: times 8 dw 0x00ff const_d: times 4 dd 0x00ffffff const_uv_to_rgb_mul: ... const_uv_to_rgb_add: ... movdqa xmm4, [esi] movdqa xmm0, xmm4 movdqa xmm1, xmm4 pshufb xmm0, [const_a] pshufb xmm1, [const_b] pand xmm4, [const_c] ; xmm0: uv0 uv0 uv0 uv0 uv2 uv2 uv2 uv2 uv2 ; xmm1: uv4 uv4 uv4 uv4 ... ; xmm4: y0 0 y1 0 y2 0 y3 0 y4 0 y5 0 y6 0 y7 0 pmaddubsw xmm0, [const_uv_to_rgb_mul] pmaddubsw xmm1, [const_uv_to_rgb_mul] paddsw xmm0, [const_uv_to_rgb_add] paddsw xmm1, [const_uv_to_rgb_add] psraw xmm0, 6 psraw xmm1, 6 ; r01 g01 b01 0 r23 g23 b23 0 pshufd xmm2, xmm0, 2+3*4+2*16+3*64 pshufd xmm0, xmm0, 0+1*4+0+16+1*64 pshufd xmm3, xmm1, 2+3*4+2*16+3*64 pshufd xmm1, xmm1, 0+1*4+0+16+1*64 ; xmm0: r01 g01 b01 0 r01 g01 b01 0 ; xmm2: r23 g23 b23 0 r23 g23 b23 0 ; xmm1: r45 g45 b45 0 r45 g45 b45 0 paddsw xmm0, xmm4 ; add y paddsw xmm1, xmm4 paddsw xmm2, xmm4 paddsw xmm3, xmm4 packuswb xmm0, xmm2 ; pack with saturation into 0-255 range packuswb xmm1, xmm3 pand xmm0, [const_d] ; zero out the alpha byte pand xmm1, [const_d] movntdq [edi], xmm0 movntdq [edi+16], xmm1 </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload