Note that there are some explanatory texts on larger screens.

plurals
  1. PO5x Performance with Parallel.For... on a Dual Core?
    primarykey
    data
    text
    <p>I was doing some experimental calculations for fun, when I came across an interesting result:</p> <pre><code>Completed 1024x1024 pixels with 700 points in... For Loop (Inline): 19636ms For Loop: 12612ms Parallel.For Loop: 3835ms </code></pre> <p>Which is not what I expected.</p> <p>System: Windows 7 64, i3 2120 [dual core, 4 threads], Visual Studio 2010.</p> <p>Build : Optimization's on, Release mode [no debugger], 32 Bit.</p> <p>Of secondary interest is the disappointing 64 bit performance. While it's more inline of what I'd expect in terms of ratio's it accomplishes this by being slower across the board.</p> <pre><code>Completed 1024x1024 pixels with 700 points in... For Loop (Inline): 23409ms For Loop: 24373ms Parallel.For Loop: 6839ms </code></pre> <p>The calculation is simple: For the indices x &amp; y find the closest Vector3 and store it in 2D array.</p> <p>The question, if you dare, is to try to explain why the inline for loop is so slow. Bonus points for explaining the 64bit versions lack of performance.</p> <pre><code>using System; using System.Diagnostics; using System.Threading.Tasks; namespace TextureFromPoints { class Program { const int numPoints = 700; const int textureSize = 1024; static Random rnd = new Random(); static void Main(string[] args) { while (true) { Console.WriteLine("Starting"); Console.WriteLine(); var pointCloud = new Vector3[numPoints]; for (int i = 0; i &lt; numPoints; i++) pointCloud[i] = new Vector3(textureSize); var result1 = new Vector3[textureSize, textureSize]; var result2 = new Vector3[textureSize, textureSize]; var result3 = new Vector3[textureSize, textureSize]; var sw1 = Stopwatch.StartNew(); for (int x = 0; x &lt; textureSize; x++) for (int y = 0; y &lt; textureSize; y++) { var targetPos = new Vector3(x, y, 0); var nearestV3 = pointCloud[0]; var nearestV3Distance = nearestV3.DistanceToPoint(targetPos); for (int i = 1; i &lt; numPoints; i++) { var currentV3 = pointCloud[i]; var currentV3Distance = currentV3.DistanceToPoint(targetPos); if (currentV3Distance &lt; nearestV3Distance) { nearestV3 = currentV3; nearestV3Distance = currentV3Distance; } } result1[x, y] = nearestV3; } sw1.Stop(); var sw2 = Stopwatch.StartNew(); for (int x = 0; x &lt; textureSize; x++) for (int y = 0; y &lt; textureSize; y++) Computation(pointCloud, result2, x, y); sw2.Stop(); var sw3 = Stopwatch.StartNew(); Parallel.For(0, textureSize, x =&gt; { for (int y = 0; y &lt; textureSize; y++) Computation(pointCloud, result3, x, y); }); sw3.Stop(); Console.WriteLine("Completed {0}x{0} pixels with {1} points in...", textureSize, numPoints); Console.WriteLine("{0}: {1}ms", "For Loop (Inline)", sw1.ElapsedMilliseconds); Console.WriteLine("{0}: {1}ms", "For Loop", sw2.ElapsedMilliseconds); Console.WriteLine("{0}: {1}ms", "Parallel.For Loop", sw3.ElapsedMilliseconds); Console.WriteLine(); Console.Write("Verifying Data: "); Console.WriteLine(CheckResults(result1, result2) &amp;&amp; CheckResults(result1, result3) ? "Valid" : "Error"); Console.WriteLine(); Console.WriteLine(); Console.ReadLine(); } } private static bool CheckResults(Vector3[,] lhs, Vector3[,] rhs) { for (int x = 0; x &lt; textureSize; x++) for (int y = 0; y &lt; textureSize; y++) if (!lhs[x, y].Equals(rhs[x, y])) return false; return true; } private static void Computation(Vector3[] pointCloud, Vector3[,] result, int x, int y) { var targetPos = new Vector3(x, y, 0); var nearestV3 = pointCloud[0]; var nearestV3Distance = nearestV3.DistanceToPoint(targetPos); for (int i = 1; i &lt; numPoints; i++) { var currentV3 = pointCloud[i]; var currentV3Distance = currentV3.DistanceToPoint(targetPos); if (currentV3Distance &lt; nearestV3Distance) { nearestV3 = currentV3; nearestV3Distance = currentV3Distance; } } result[x, y] = nearestV3; } struct Vector3 { public float x; public float y; public float z; public Vector3(float x, float y, float z) { this.x = x; this.y = y; this.z = z; } public Vector3(float randomDistance) { this.x = (float)rnd.NextDouble() * randomDistance; this.y = (float)rnd.NextDouble() * randomDistance; this.z = (float)rnd.NextDouble() * randomDistance; } public static Vector3 operator -(Vector3 a, Vector3 b) { return new Vector3(a.x - b.x, a.y - b.y, a.z - b.z); } public float sqrMagnitude() { return x * x + y * y + z * z; } public float DistanceToPoint(Vector3 point) { return (this - point).sqrMagnitude(); } } } } </code></pre> <p>UPDATE: Thanks to the efforts of <strong>Drew Marsh</strong> we now have this super optimized version that inlines all the V3 operations.</p> <pre><code>using System; using System.Diagnostics; using System.Threading.Tasks; namespace TextureFromPoints { class RevisedProgram { const int numPoints = 700; const int textureSize = 1024; static Random rnd = new Random(); static void Main(string[] args) { while (true) { Console.WriteLine("Starting REVISED"); Console.WriteLine(); var pointCloud = new Vector3[numPoints]; for (int i = 0; i &lt; numPoints; i++) pointCloud[i] = new Vector3(textureSize); var result1 = new Vector3[textureSize, textureSize]; var result2 = new Vector3[textureSize, textureSize]; var result3 = new Vector3[textureSize, textureSize]; var sw1 = Inline(pointCloud, result1); var sw2 = NotInline(pointCloud, result2); var sw3 = Parallelized(pointCloud, result3); Console.WriteLine("Completed {0}x{0} pixels with {1} points in...", textureSize, numPoints); Console.WriteLine("{0}: {1}ms", "For Loop (Inline)", sw1.ElapsedMilliseconds); Console.WriteLine("{0}: {1}ms", "For Loop", sw2.ElapsedMilliseconds); Console.WriteLine("{0}: {1}ms", "Parallel.For Loop", sw3.ElapsedMilliseconds); Console.WriteLine(); Console.Write("Verifying Data: "); Console.WriteLine(CheckResults(result1, result2) &amp;&amp; CheckResults(result1, result3) ? "Valid" : "Error"); Console.WriteLine(); Console.WriteLine(); Console.ReadLine(); } } private static Stopwatch Parallelized(Vector3[] pointCloud, Vector3[,] result3) { var sw3 = Stopwatch.StartNew(); Parallel.For(0, textureSize, x =&gt; { for (int y = 0; y &lt; textureSize; y++) Computation(pointCloud, result3, x, y); }); sw3.Stop(); return sw3; } private static Stopwatch NotInline(Vector3[] pointCloud, Vector3[,] result2) { var sw2 = Stopwatch.StartNew(); for (int x = 0; x &lt; textureSize; x++) for (int y = 0; y &lt; textureSize; y++) Computation(pointCloud, result2, x, y); sw2.Stop(); return sw2; } private static Stopwatch Inline(Vector3[] pointCloud, Vector3[,] result1) { var sw1 = Stopwatch.StartNew(); for (int x = 0; x &lt; textureSize; x++) for (int y = 0; y &lt; textureSize; y++) { var targetPos = new Vector3(x, y, 0); var nearestV3 = pointCloud[0]; Vector3 temp1 = new Vector3(nearestV3.x - targetPos.x, nearestV3.y - targetPos.y, nearestV3.z - targetPos.z); var nearestV3Distance = temp1.x * temp1.x + temp1.y * temp1.y + temp1.z * temp1.z; for (int i = 1; i &lt; numPoints; i++) { var currentV3 = pointCloud[i]; Vector3 temp2 = new Vector3(currentV3.x - targetPos.x, currentV3.y - targetPos.y, currentV3.z - targetPos.z); var currentV3Distance = temp2.x * temp2.x + temp2.y * temp2.y + temp2.z * temp2.z; if (currentV3Distance &lt; nearestV3Distance) { nearestV3 = currentV3; nearestV3Distance = currentV3Distance; } } result1[x, y] = nearestV3; } sw1.Stop(); return sw1; } private static bool CheckResults(Vector3[,] lhs, Vector3[,] rhs) { for (int x = 0; x &lt; textureSize; x++) for (int y = 0; y &lt; textureSize; y++) if (!lhs[x, y].Equals(rhs[x, y])) return false; return true; } private static void Computation(Vector3[] pointCloud, Vector3[,] result, int x, int y) { var targetPos = new Vector3(x, y, 0); var nearestV3 = pointCloud[0]; Vector3 temp1 = new Vector3(nearestV3.x - targetPos.x, nearestV3.y - targetPos.y, nearestV3.z - targetPos.z); var nearestV3Distance = temp1.x * temp1.x + temp1.y * temp1.y + temp1.z * temp1.z; for (int i = 1; i &lt; numPoints; i++) { var currentV3 = pointCloud[i]; Vector3 temp2 = new Vector3(currentV3.x - targetPos.x, currentV3.y - targetPos.y, currentV3.z - targetPos.z); var currentV3Distance = temp2.x * temp2.x + temp2.y * temp2.y + temp2.z * temp2.z; if (currentV3Distance &lt; nearestV3Distance) { nearestV3 = currentV3; nearestV3Distance = currentV3Distance; } } result[x, y] = nearestV3; } struct Vector3 { public float x; public float y; public float z; public Vector3(float x, float y, float z) { this.x = x; this.y = y; this.z = z; } public Vector3(float randomDistance) { this.x = (float)rnd.NextDouble() * randomDistance; this.y = (float)rnd.NextDouble() * randomDistance; this.z = (float)rnd.NextDouble() * randomDistance; } } } } </code></pre> <p>And it gives the following results:</p> <p>x86</p> <pre><code>Completed 1024x1024 pixels with 700 points in... For Loop (Inline): 3820ms For Loop: 3962ms Parallel.For Loop: 1681ms </code></pre> <p>x64</p> <pre><code>Completed 1024x1024 pixels with 700 points in... For Loop (Inline): 10978ms For Loop: 10924ms Parallel.For Loop: 3073ms </code></pre> <p>So the good news is that we can drastically increase the performance of this code - and get the single threaded version to be operating at a speed somewhat in keeping with its parallel cousin.</p> <p>The bad news is that this means ditching x64 entirely and manually in-lining all math.</p> <p>At this stage, I'm very disappointed in the performance of the compilers - I expected them to be much better.</p> <p><strong>Conclusion</strong></p> <p>This is fubar and sad... and while we don't really know why we can make an educated guess to it being caused by a stupid compiler/s. 24s to 3.8s simply by changing the compiler from x64 to x86 and doing some manual in-lining is not what I'd expect. However I've finished off the proof of concept I was writing, and thanks to a simple spacial hash I can compute a 1024 by 1024 image with 70,000 'points' in 0.7s - ~340000% faster than that of my original x64 scenario and with no threading or in-lining. As such I've accepted an answer - the immediate need is gone, though I'l be still looking into the issue.</p> <p>The code is available <a href="http://pastebin.com/dACaZ8dE" rel="nofollow">here</a> and <a href="http://pastebin.com/KWkfDmVT" rel="nofollow">here</a> - it generates a nice Voronoi diagram as a side effect :P</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload