Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>I changed the code to force <code>rcoords</code> and <code>particle_extended</code>, and disovered we were losing the lion's share of time within them directly:</p> <pre><code>COST CENTRE MODULE %time %alloc rcoords Main 32.6 34.4 particle_extended Main 21.5 27.2 **^ Main 9.8 12.7 </code></pre> <p>The biggest single improvement to this code would clearly be to generate those two constant inputs in a better fashion.</p> <p>Note that this is basically a lazy, streaming algorithm, and where you're losing time is the sunk cost of allocating at least two 24361803-element arrays all in one go, and then probably allocating at least once or twice more or giving up sharing. The very best case for this code, I think, with a very good optimizer and a zillion rewrite rules, will be to roughly match the list version (which can also parallelize very easily).</p> <p>I think dons is right that Ben &amp; co. will be interested in this benchmark, but my overwhelming suspicion is that this is not a good use case for a strict array library, and my suspicion is that matlab is hiding some clever optimizations behind its <code>ngrid</code> function (optimizations, I'll grant, which it might be useful to port to repa).]</p> <p><strong>Edit:</strong></p> <p>Here's a quick and dirty way to parallelize the list code. Import <code>Control.Parallel.Strategies</code> and then write <code>numberInsideParticles</code> as:</p> <pre><code>numberInsideParticles particles coords = length $ filter id $ withStrategy (parListChunk 2000 rseq) $ P.map (insideParticles particles) coords </code></pre> <p>This shows good speedup as we scale up cores (12s at one core to 3.7s at 8), but the overhead of spark creation means that even a 8 cores we only match the single core non-parallel version. I tried a few alternate strategies and got similar results. Again, I'm not sure how much better we can possibly do than a single-threaded list version here. Since the computations on each individual particle are so cheap, we're mainly stressing allocation, not computation. The big win on something like this I imagine would be vectorized computation more than anything else, and as far as I know that pretty much requires hand-coding.</p> <p>Also note that the parallel version spends roughly 70% of its time in GC, while the one-core version spend 1% of its time there (i.e. the allocation is, to the extent possible, is effectively fused away.).</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload