Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I suppose one should start with a higher level description of the problem.</p> <p>But as a secondary option, I'd suggest swapping the array indices to make it easier to code a blazingly fast SSE inner loop that combines four (possibly distinct) vectors:</p> <pre><code> double times[N+1][nsims], *tend = times[N+1]; double *j,*k,*i,*l; for (j=times[2];j&lt;tend;j+=nsims) for (k=j;k&lt;tend;k+=nsims) if (strcmp( )) ... /* One _can_ move this elsewhere, but why bother? */ for (i=times[2];i&lt;tend;i+=nsims) for (l=i;l&lt;tend;l+=nsims) { interm = efficient_sse_implementation(j,k,i,l, nsims); ... } </code></pre> <p>Miniscule optimization could also be achieved by writing different kernel for the cases where there are less than 4 distinct arrays. (In that case one memory operation per stride can be skipped.)</p> <p><strong>EDIT</strong></p> <p>The structure of pattern <code>for(j=2;j&lt;=N;j++) for (k=j;k&lt;=N;k++)</code> repeats twice in this case and that alone implies a possibility of much higher level optimization -- what is the operation performed? While struggling at that, this pattern still suggests another method: caching the 780 (or so) subproducts, but at the same time performing <em>loop blocking</em>. This approach should not have the same problem than what I commented to mr. gcbenison.</p> <pre><code> for (A=0;A&lt;50000;A+=100) { int k=0; for (i=2;i&lt;=N;i++) for (j=i;j&lt;=N;j++,k++) for (a=0;a&lt;100;a++) precalc[k][a]=times[i][A+a]*times[j][A+a]; for (i=0;i&lt;k;i++) // Now i loops from 0..779 or so for (j=0;j&lt;k;j++) { for (a=0;a&lt;100;a++) partial_product+=precalc[i][a]*precalc[j][a]; // accumulate also partial_product to moment } } </code></pre> <p>Disclaimer: <em>this</em> is untried, but there exists some block size (not necessarily 100) that is optimal (and it can be even worse than the previous thing). Also note, that this approach uses a lot of memory for the precalculated table. (Choosing block size of 100 costs 624000 bytes of memory, which sounds rather good. To get below 256k, the block length can be only 42).</p> <p><strong>EDIT 2</strong>: </p> <p>// Notice that the loop in EDIT_1 calculates both <code>P[2][a]*P[3][a]</code> and <code>P[3][a]*P[2][a]</code>.</p> <pre><code> for (i=0;i&lt;k;i++) // Now i loops from 0..779 or so, but... we can limit the for (j=i;j&lt;k;j++) { // calculation to the upper triangle of the 780^2 matrix for (a=0;a&lt;100;a++) partial_product+=precalc[i][a]*precalc[j][a]; moment[i]+=partial_product; moment[lower_triangle(i)]+=partial_product; // &lt;-- 50% speed increase } </code></pre> <p><strong>EDIT 3:</strong> And here's something to try:</p> <pre><code>gcc -O4 -DCACHELEVEL=2 -DPOPULATE=1 -DARRAY_OPT=1 &amp;&amp; time ./a.out </code></pre> <ul> <li><code>POPULATE</code> initializes the array (assuming that non-zero contents matters) </li> <li><code>ARRAY_OPT=1</code> switches the array indices to (perhaps) better order </li> <li><code>CACHELEVEL=2</code> or 3 toggles in caching of intermediate results. </li> <li><code>STRCMP</code> can be found in the source code to test memcmp vs. strcmp vs '1'</li> </ul> <p><strong>NOT TODO 1</strong>: LOOP_BLOCKING with cached values -- decreases performance<br> <strong>TODO 2</strong>: Upper triangle calculation only<br> <strong>TODO 3</strong>: Find out the meaning of <code>changed_times[n]</code> and <code>moments[0][p]</code><br> - as it stands out now, none of the computations are saved!</p> <pre><code>#include &lt;stddef.h&gt; #define N 40 #define nsims 8000 #if ARRAY_OPT #define TIMES(n,a) times[n][a] double times[N+1][nsims]; // [nsims]; #else #define TIMES(n,a) times[a][n] double times[nsims][N+1]; #endif #define STRCMP 1 // vs. // #define STRCMP1 strcmp(mmethod, "emp")==0 void init() { #ifdef POPULATE int i,a; for (i=0;i&lt;=N;i++) for (a=0;a&lt;nsims;a++) TIMES(i,a) = (double)((i^a)&amp;7) - 3.5; #endif } double moments[4000] = { 0 }; double cache1[nsims]; double cache2[nsims]; int main() { int j,k,i,l,a, pcount=0; init(); int changed_times[N+1]={0}; char *mmethod="emp"; double moment,interm; for(j=2;j&lt;=N;j++) { for(k=j;k&lt;=N;k++) { #if CACHELEVEL == 2 for (a=0;a&lt;nsims;a++) cache1[a]=TIMES(j,a)*TIMES(k,a); #endif moment=0; for(i=2;i&lt;=N;i++) { #if CACHELEVEL == 3 for (a=0;a&lt;nsims;a++) cache2[a]=TIMES(j,a)*TIMES(k,a)*TIMES(i,a); #else for (a=0;a&lt;nsims;a++) cache2[a]=cache1[a]*TIMES(i,a); #endif for(l=i;l&lt;=N;l++) { if(STRCMP) { for(a=0;a&lt;nsims;a++) { #if CACHELEVEL &gt;= 2 interm += (double) cache2[a]*TIMES(l,a); #else interm=interm + (double) TIMES(k,a) * TIMES(j,a) * TIMES(i,a) * TIMES(l,a); #endif } interm = (double) interm/(double)nsims; moment = moment + (interm*i*l); interm=0; } } } //if(!(changed_times[k]==0 // &amp;&amp; changed_times[j]==0 // &amp;&amp; changed_times[l]==0 // &amp;&amp; changed_times[i]==0)) //{ // moments[0][pcount]=(double) moment; // changed_times[k]++;changed_times[j]++; /* or what? */ // changed_times[l]++;changed_times[i]++; //} else { // moments[0][pcount]=moments[0][pcount]; //} pcount++; } } printf("%d %f\n",pcount, moment); } </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload