Note that there are some explanatory texts on larger screens.

plurals
  1. POFast bignum square computation
    primarykey
    data
    text
    <p>To speed up my bignum divisons I need to speed up operation <code>y = x^2</code> for bigints which are represented as dynamic arrays of unsigned DWORDs. To be clear:</p> <pre><code>DWORD x[n+1] = { LSW, ......, MSW }; </code></pre> <ul> <li>where n+1 is number of used DWORDs</li> <li>so value of number <code>x = x[0]+x[1]&lt;&lt;32 + ... x[N]&lt;&lt;32*(n)</code></li> </ul> <p>The question is: <strong>How do I compute <code>y = x^2</code> as fast as possible without precision loss?</strong> - Using <strong>C++</strong> and with integer arithmetics (32bit with Carry) at disposal.</p> <p>My current approach is applying multiplication <code>y = x*x</code> and avoid multiple multiplications.</p> <p>For example:</p> <pre><code>x = x[0] + x[1]&lt;&lt;32 + ... x[n]&lt;&lt;32*(n) </code></pre> <p>For simplicity, let me rewrite it:</p> <pre><code>x = x0+ x1 + x2 + ... + xn </code></pre> <p>where index represent the address inside the array, so:</p> <pre><code>y = x*x y = (x0 + x1 + x2 + ...xn)*(x0 + x1 + x2 + ...xn) y = x0*(x0 + x1 + x2 + ...xn) + x1*(x0 + x1 + x2 + ...xn) + x2*(x0 + x1 + x2 + ...xn) + ...xn*(x0 + x1 + x2 + ...xn) y0 = x0*x0 y1 = x1*x0 + x0*x1 y2 = x2*x0 + x1*x1 + x0*x2 y3 = x3*x0 + x2*x1 + x1*x2 ... y(2n-3) = xn(n-2)*x(n ) + x(n-1)*x(n-1) + x(n )*x(n-2) y(2n-2) = xn(n-1)*x(n ) + x(n )*x(n-1) y(2n-1) = xn(n )*x(n ) </code></pre> <p>After a closer look, it is clear that almost all <code>xi*xj</code> appears twice (not the first and last one) which means that <code>N*N</code> multiplications can be replaced by <code>(N+1)*(N/2)</code> multiplications. P.S. <code>32bit*32bit = 64bit</code> so the result of every <code>mul+add</code> operation is handled as <code>64+1 bit</code>.</p> <p>Is there a better way to compute this fast? All I found during searches were sqrts algorithms, not sqr...</p> <p><strong>Fast sqr</strong></p> <p>!!! Beware that all numbers in my code are MSW first,... not as in above test (there are LSW first for simplicity of equations, otherwise it would be an index mess).</p> <p>Current functional fsqr implementation</p> <pre><code>void arbnum::sqr(const arbnum &amp;x) { // O((N+1)*N/2) arbnum c; DWORD h, l; int N, nx, nc, i, i0, i1, k; c._alloc(x.siz + x.siz + 1); nx = x.siz - 1; nc = c.siz - 1; N = nx + nx; for (i=0; i&lt;=nc; i++) c.dat[i]=0; for (i=1; i&lt;N; i++) for (i0=0; (i0&lt;=nx) &amp;&amp; (i0&lt;=i); i0++) { i1 = i - i0; if (i0 &gt;= i1) break; if (i1 &gt; nx) continue; h = x.dat[nx-i0]; if (!h) continue; l = x.dat[nx-i1]; if (!l) continue; alu.mul(h, l, h, l); k = nc - i; if (k &gt;= 0) alu.add(c.dat[k], c.dat[k], l); k--; if (k&gt;=0) alu.adc(c.dat[k], c.dat[k],h); k--; for (; (alu.cy) &amp;&amp; (k&gt;=0); k--) alu.inc(c.dat[k]); } c.shl(1); for (i = 0; i &lt;= N; i += 2) { i0 = i&gt;&gt;1; h = x.dat[nx-i0]; if (!h) continue; alu.mul(h, l, h, h); k = nc - i; if (k &gt;= 0) alu.add(c.dat[k], c.dat[k],l); k--; if (k&gt;=0) alu.adc(c.dat[k], c.dat[k], h); k--; for (; (alu.cy) &amp;&amp; (k &gt;= 0); k--) alu.inc(c.dat[k]); } c.bits = c.siz&lt;&lt;5; c.exp = x.exp + x.exp + ((c.siz - x.siz - x.siz)&lt;&lt;5) + 1; c.sig = sig; *this = c; } </code></pre> <p><strong>Use of Karatsuba multiplication</strong></p> <p>(thanks to Calpis)</p> <p>I implemented Karatsuba multiplication but the results are massively slower even than by use of simple <code>O(N^2)</code> multiplication, probably because of that horrible recursion that I can't see any way to avoid. It's trade-off must be at really large numbers (bigger than hundreds of digits) ... but even then there are a lot of memory transfers. Is there a way to avoid recursion calls (non-recursive variant,... Almost all recursive algorithms can be done that way). Still, I will try to tweak things up and see what happens (avoid normalizations, etc..., also it could be some silly mistake in the code). Anyway, after solving Karatsuba for case <code>x*x</code> there is not much performance gain.</p> <p><strong>Optimized Karatsuba multiplication</strong></p> <p>Performance test for <code>y = x^2 looped 1000x times, 0.9 &lt; x &lt; 1 ~ 32*98 bits</code>:</p> <pre><code>x = 0.98765588997654321000000009876... | 98*32 bits sqr [ 213.989 ms ] ... O((N+1)*N/2) fast sqr mul1[ 363.472 ms ] ... O(N^2) classic multiplication mul2[ 349.384 ms ] ... O(3*(N^log2(3))) optimized Karatsuba multiplication mul3[ 9345.127 ms] ... O(3*(N^log2(3))) unoptimized Karatsuba multiplication x = 0.98765588997654321000... | 195*32 bits sqr [ 883.01 ms ] mul1[ 1427.02 ms ] mul2[ 1089.84 ms ] x = 0.98765588997654321000... | 389*32 bits sqr [ 3189.19 ms ] mul1[ 5553.23 ms ] mul2[ 3159.07 ms ] </code></pre> <p>After optimizations for Karatsuba, the code is massively faster than before. Still, for smaller numbers it is slightly less than half speed of my <code>O(N^2)</code> multiplication. For bigger numbers, it is faster with the ratio given by the complexities of Booth multiplications. The threshold for multiplication is around 32*98 bits and for sqr around 32*389 bits, so if the sum of input bits cross this threshold then Karatsuba multiplication will be used for speeding up multiplication and that goes similar for sqr too.</p> <p>BTW, optimizations included:</p> <ul> <li>Minimize heap trashing by too-big recursion argument</li> <li>Avoidance of any bignum aritmetics (+,-) 32-bit ALU with carry is used instead.</li> <li>Ignoring <code>0*y</code> or <code>x*0</code> or <code>0*0</code> cases</li> <li>Reformatting input <code>x,y</code> number sizes to power of two to avoid reallocating</li> <li>Implement modulo multiplication for <code>z1 = (x0 + x1)*(y0 + y1)</code> to minimize recursion</li> </ul> <p><strong>Modified Schönhage-Strassen multiplication to sqr implementation</strong></p> <p>I have tested use of <strong>FFT</strong> and <strong>NTT</strong> transforms to speed up sqr computation. The results are these:</p> <ol> <li><p><strong>FFT</strong></p> <p>Lose accuracy and therefore need high precision complex numbers. This actually slows things down considerably so no speedup is present. The result is not precise (can be wrongly rounded)so <strong>FFT</strong> is unusable (for now)</p></li> <li><p><strong>NTT</strong></p> <p><strong>NTT</strong> is finite field <strong>DFT</strong> and so no accuracy loss occurs. It need modular arithmetics on unsigned integers: <code>modpow, modmul, modadd</code> and <code>modsub</code>.</p> <p>I use <code>DWORD</code> (32bit unsigned integer numbers). The <strong>NTT</strong> input/otput vector size is limited because of overflow issues!!! For 32-bit modular arithmetics, <code>N</code> is limited to <code>(2^32)/(max(input[])^2)</code> so <code>bigint</code> must be divided to smaller chunks (I use <code>BYTES</code> so maximum size of <code>bigint</code> processed is</p> <pre><code>(2^32)/((2^8)^2) = 2^16 bytes = 2^14 DWORDs = 16384 DWORDs) </code></pre> <p>The <code>sqr</code> uses only <code>1xNTT + 1xINTT</code> instead of <code>2xNTT + 1xINTT</code> for multiplication but <strong>NTT</strong> usage is too slow and the threshold number size is too large for practical use in my implementation (for <code>mul</code> and also for <code>sqr</code>).</p> <p>Is possible that is even over the overflow limit so 64-bit modular arithmetics should be used which can slow things down even more. So <strong>NTT</strong> is for my purposes also unusable too.</p></li> </ol> <p><strong>Some measurements:</strong></p> <pre><code>a = 0.98765588997654321000 | 389*32 bits looped 1x times sqr1[ 3.177 ms ] fast sqr sqr2[ 720.419 ms ] NTT sqr mul1[ 5.588 ms ] simpe mul mul2[ 3.172 ms ] karatsuba mul mul3[ 1053.382 ms ] NTT mul </code></pre> <p><strong>My implementation:</strong></p> <pre><code>void arbnum::sqr_NTT(const arbnum &amp;x) { // O(N*log(N)*(log(log(N)))) - 1x NTT // Schönhage-Strassen sqr // To prevent NTT overflow: n &lt;= 48K * 8 bit -&gt; result siz &lt;= 12K * 32 bit -&gt; x.siz + y.siz &lt;= 12K!!! int i, j, k, n; int s = x.sig*x.sig, exp0 = x.exp + x.exp - ((x.siz+x.siz)&lt;&lt;5) + 2; i = x.siz; for (n = 1; n &lt; i; n&lt;&lt;=1) ; if (n + n &gt; 0x3000) { _error(_arbnum_error_TooBigNumber); zero(); return; } n &lt;&lt;= 3; DWORD *xx, *yy, q, qq; xx = new DWORD[n+n]; #ifdef _mmap_h if (xx) mmap_new(xx, (n+n) &lt;&lt; 2); #endif if (xx==NULL) { _error(_arbnum_error_NotEnoughMemory); zero(); return; } yy = xx + n; // Zero padding (and split DWORDs to BYTEs) for (i--, k=0; i &gt;= 0; i--) { q = x.dat[i]; xx[k] = q&amp;0xFF; k++; q&gt;&gt;=8; xx[k] = q&amp;0xFF; k++; q&gt;&gt;=8; xx[k] = q&amp;0xFF; k++; q&gt;&gt;=8; xx[k] = q&amp;0xFF; k++; } for (;k&lt;n;k++) xx[k] = 0; //NTT fourier_NTT ntt; ntt.NTT(yy,xx,n); // init NTT for n // Convolution for (i=0; i&lt;n; i++) yy[i] = modmul(yy[i], yy[i], ntt.p); //INTT ntt.INTT(xx, yy); //suma q=0; for (i = 0, j = 0; i&lt;n; i++) { qq = xx[i]; q += qq&amp;0xFF; yy[n-i-1] = q&amp;0xFF; q&gt;&gt;=8; qq&gt;&gt;=8; q+=qq; } // Merge WORDs to DWORDs and copy them to result _alloc(n&gt;&gt;2); for (i = 0, j = 0; i&lt;siz; i++) { q =(yy[j]&lt;&lt;24)&amp;0xFF000000; j++; q |=(yy[j]&lt;&lt;16)&amp;0x00FF0000; j++; q |=(yy[j]&lt;&lt; 8)&amp;0x0000FF00; j++; q |=(yy[j] )&amp;0x000000FF; j++; dat[i] = q; } #ifdef _mmap_h if (xx) mmap_del(xx); #endif delete xx; bits = siz&lt;&lt;5; sig = s; exp = exp0 + (siz&lt;&lt;5) - 1; // _normalize(); } </code></pre> <p><strong>Conclusion</strong></p> <p>For smaller numbers, it is the best option my fast <code>sqr</code> approach, and after threshold <strong>Karatsuba</strong> multiplication is better. But I still think there should be something trivial which we have overlooked. Has anyone other ideas?</p> <p><strong>NTT optimization</strong></p> <p>After massively-intense optimizations (mostly <strong>NTT</strong>): Stack Overflow question <a href="https://stackoverflow.com/questions/18577076">Modular arithmetics and NTT (finite field DFT) optimizations</a>.</p> <p>Some values have changed:</p> <pre><code>a = 0.98765588997654321000 | 1553*32bits looped 10x times mul2[ 28.585 ms ] Karatsuba mul mul3[ 26.311 ms ] NTT mul </code></pre> <p>So now <strong>NTT</strong> multiplication is finally faster than <strong>Karatsuba</strong> after about 1500*32-bit threshold.</p> <p><strong>Some measurements and bug spotted</strong></p> <pre><code>a = 0.99991970486 | 1553*32 bits looped: 10x sqr1[ 58.656 ms ] fast sqr sqr2[ 13.447 ms ] NTT sqr mul1[ 102.563 ms ] simpe mul mul2[ 28.916 ms ] Karatsuba mul Error mul3[ 19.470 ms ] NTT mul </code></pre> <p>I found out that my <strong>Karatsuba</strong> (over/under)flows the <strong>LSB</strong> of each <code>DWORD</code> segment of bignum. When I have researched, I will update the code...</p> <p>Also, after further <strong>NTT</strong> optimizations the thresholds changed, so for <strong>NTT sqr</strong> it is <code>310*32 bits = 9920 bits</code> of <strong>operand</strong>, and for <strong>NTT mul</strong> it is <code>1396*32 bits = 44672 bits</code> of <strong>result</strong> (sum of bits of operands).</p> <p><strong>Karatsuba code repaired thanks to @greybeard</strong></p> <pre><code>//--------------------------------------------------------------------------- void arbnum::_mul_karatsuba(DWORD *z, DWORD *x, DWORD *y, int n) { // Recursion for Karatsuba // z[2n] = x[n]*y[n]; // n=2^m int i; for (i=0; i&lt;n; i++) if (x[i]) { i=-1; break; } // x==0 ? if (i &lt; 0) for (i = 0; i&lt;n; i++) if (y[i]) { i = -1; break; } // y==0 ? if (i &gt;= 0) { for (i = 0; i &lt; n + n; i++) z[i]=0; return; } // 0.? = 0 if (n == 1) { alu.mul(z[0], z[1], x[0], y[0]); return; } if (n&lt; 1) return; int n2 = n&gt;&gt;1; _mul_karatsuba(z+n, x+n2, y+n2, n2); // z0 = x0.y0 _mul_karatsuba(z , x , y , n2); // z2 = x1.y1 DWORD *q = new DWORD[n&lt;&lt;1], *q0, *q1, *qq; BYTE cx,cy; if (q == NULL) { _error(_arbnum_error_NotEnoughMemory); return; } #define _add { alu.add(qq[i], q0[i], q1[i]); for (i--; i&gt;=0; i--) alu.adc(qq[i], q0[i], q1[i]); } // qq = q0 + q1 ...[i..0] #define _sub { alu.sub(qq[i], q0[i], q1[i]); for (i--; i&gt;=0; i--) alu.sbc(qq[i], q0[i], q1[i]); } // qq = q0 - q1 ...[i..0] qq = q; q0 = x + n2; q1 = x; i = n2 - 1; _add; cx = alu.cy; // =x0+x1 qq = q + n2; q0 = y + n2; q1 = y; i = n2 - 1; _add; cy = alu.cy; // =y0+y1 _mul_karatsuba(q + n, q + n2, q, n2); // =(x0+x1)(y0+y1) mod ((2^N)-1) if (cx) { qq = q + n; q0 = qq; q1 = q + n2; i = n2 - 1; _add; cx = alu.cy; }// += cx*(y0 + y1) &lt;&lt; n2 if (cy) { qq = q + n; q0 = qq; q1 = q; i = n2 -1; _add; cy = alu.cy; }// +=cy*(x0+x1)&lt;&lt;n2 qq = q + n; q0 = qq; q1 = z + n; i = n - 1; _sub; // -=z0 qq = q + n; q0 = qq; q1 = z; i = n - 1; _sub; // -=z2 qq = z + n2; q0 = qq; q1 = q + n; i = n - 1; _add; // z1=(x0+x1)(y0+y1)-z0-z2 DWORD ccc=0; if (alu.cy) ccc++; // Handle carry from last operation if (cx || cy) ccc++; // Handle carry from before last operation if (ccc) { i = n2 - 1; alu.add(z[i], z[i], ccc); for (i--; i&gt;=0; i--) if (alu.cy) alu.inc(z[i]); else break; } delete[] q; #undef _add #undef _sub } //--------------------------------------------------------------------------- void arbnum::mul_karatsuba(const arbnum &amp;x, const arbnum &amp;y) { // O(3*(N)^log2(3)) ~ O(3*(N^1.585)) // Karatsuba multiplication // int s = x.sig*y.sig; arbnum a, b; a = x; b = y; a.sig = +1; b.sig = +1; int i, n; for (n = 1; (n &lt; a.siz) || (n &lt; b.siz); n &lt;&lt;= 1) ; a._realloc(n); b._realloc(n); _alloc(n + n); for (i=0; i &lt; siz; i++) dat[i]=0; _mul_karatsuba(dat, a.dat, b.dat, n); bits = siz &lt;&lt; 5; sig = s; exp = a.exp + b.exp + ((siz-a.siz-b.siz)&lt;&lt;5) + 1; // _normalize(); } //--------------------------------------------------------------------------- </code></pre> <p><strong>My <code>arbnum</code> number representation:</strong></p> <pre><code>// dat is MSDW first ... LSDW last DWORD *dat; int siz,exp,sig,bits; </code></pre> <ul> <li><code>dat[siz]</code> is the mantisa. LSDW means least significant DWORD.</li> <li><code>exp</code> is the exponent of MSB of <code>dat[0]</code></li> <li><p>The first nonzero bit is present in the mantissa!!!</p> <pre><code>// |-----|---------------------------|---------------|------| // | sig | MSB mantisa LSB | exponent | bits | // |-----|---------------------------|---------------|------| // | +1 | 0.(0 ... 0) | 2^0 | 0 | +zero // | -1 | 0.(0 ... 0) | 2^0 | 0 | -zero // |-----|---------------------------|---------------|------| // | +1 | 1.(dat[0] ... dat[siz-1]) | 2^exp | n | +number // | -1 | 1.(dat[0] ... dat[siz-1]) | 2^exp | n | -number // |-----|---------------------------|---------------|------| // | +1 | 1.0 | 2^+0x7FFFFFFE | 1 | +infinity // | -1 | 1.0 | 2^+0x7FFFFFFE | 1 | -infinity // |-----|---------------------------|---------------|------| </code></pre></li> </ul>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload