Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p><strong>All data from 8 core i7, Win7, x64</strong></p> <p>It's surprising that you get 5x for sure. One problem with this test as you've written it is that you've put all three approaches in your Main method which is forcing gobblygook that the compiler has to create and keep synched to fulfill the needs of the closure used in the <code>Parallel.For</code> is getting in the way of the inline method. If you break out the work as follows you will see significantly faster performance in all three implementations... for x86 at least:</p> <p><strong>Before x86:</strong></p> <pre><code>For Loop (Inline): 24313ms For Loop: 25236ms Parallel.For Loop: 3840ms </code></pre> <p><strong>After x86:</strong></p> <pre><code>For Loop (Inline): 13007ms For Loop: 13013ms Parallel.For Loop: 2208ms </code></pre> <p>So, looking at my x86 Parallel.For results, you see it scales at about ~5.9x and each version is much quicker when isolated.</p> <p>Next, it's interesting to note that there's absolutely no gain in x64 after this same change. In fact, it ended just a little higher in each run on 2 of 3 tests consistently.</p> <p><strong>Before x64</strong></p> <pre><code>For Loop (Inline): 24222ms For Loop: 25197ms Parallel.For Loop: 3810ms </code></pre> <p><strong>After x64</strong></p> <pre><code>For Loop (Inline): 25302ms For Loop: 25209ms Parallel.For Loop: 3821ms </code></pre> <p>I don't have a direct answer why why x64 would be so bad other than the fact that people consistently come up with code like this that makes the x64 JIT look bad, so maybe someone else can chime in on that. </p> <p>That said I do have one other thing you might want to consider looking into in such an implementation: cache line invalidation. <a href="http://msdn.microsoft.com/en-us/magazine/cc872851.aspx" rel="noreferrer">There is an awesome MSDN article here</a> written by @StephenToub that explains what this is all about. The TL;DR; of it is that, because all your data is stored in one array and diff. cores with different local (L2) caches are going to modify parts of that array they have to synchronize the data with the other cores with whom they overlap. If the sections the diff. cores are working on are too close together you're going to end up with a lot of these synchronizations which can eat into your parallel gains. The article shows a technique where you actually allocate extra space in your working array sufficient enough to separate the actual sections containing the data you're going to work on so that when those cores work on the data they don't have to invalidate the other cores. of the for loop rather than being closer to 8x than that. I would bet if you put in the work to address any cache line invalidation that you could squeeze another 10%+ out of it. Just remember there's always some overhead in setting up and coordinating the parallel work so you'll never get 100% perfection.</p> <p>Here's the revised version of your program with each approach factored into separate methods:</p> <pre><code>using System; using System.Diagnostics; using System.Threading.Tasks; namespace TextureFromPoints { class RevisedProgram { const int numPoints = 700; const int textureSize = 1024; static Random rnd = new Random(); static void Main(string[] args) { while(true) { Console.WriteLine("Starting REVISED"); Console.WriteLine(); var pointCloud = new Vector3[numPoints]; for(int i = 0; i &lt; numPoints; i++) pointCloud[i] = new Vector3(textureSize); var result1 = new Vector3[textureSize, textureSize]; var result2 = new Vector3[textureSize, textureSize]; var result3 = new Vector3[textureSize, textureSize]; var sw1 = Inline(pointCloud, result1); var sw2 = NotInline(pointCloud, result2); var sw3 = Parallelized(pointCloud, result3); Console.WriteLine("Completed {0}x{0} pixels with {1} points in...", textureSize, numPoints); Console.WriteLine("{0}: {1}ms", "For Loop (Inline)", sw1.ElapsedMilliseconds); Console.WriteLine("{0}: {1}ms", "For Loop", sw2.ElapsedMilliseconds); Console.WriteLine("{0}: {1}ms", "Parallel.For Loop", sw3.ElapsedMilliseconds); Console.WriteLine(); Console.Write("Verifying Data: "); Console.WriteLine(CheckResults(result1, result2) &amp;&amp; CheckResults(result1, result3) ? "Valid" : "Error"); Console.WriteLine(); Console.WriteLine(); Console.ReadLine(); } } private static Stopwatch Parallelized(Vector3[] pointCloud, Vector3[,] result3) { var sw3 = Stopwatch.StartNew(); Parallel.For(0, textureSize, x =&gt; { for(int y = 0; y &lt; textureSize; y++) Computation(pointCloud, result3, x, y); }); sw3.Stop(); return sw3; } private static Stopwatch NotInline(Vector3[] pointCloud, Vector3[,] result2) { var sw2 = Stopwatch.StartNew(); for(int x = 0; x &lt; textureSize; x++) for(int y = 0; y &lt; textureSize; y++) Computation(pointCloud, result2, x, y); sw2.Stop(); return sw2; } private static Stopwatch Inline(Vector3[] pointCloud, Vector3[,] result1) { var sw1 = Stopwatch.StartNew(); for(int x = 0; x &lt; textureSize; x++) for(int y = 0; y &lt; textureSize; y++) { var targetPos = new Vector3(x, y, 0); var nearestV3 = pointCloud[0]; var nearestV3Distance = nearestV3.DistanceToPoint(targetPos); for(int i = 1; i &lt; numPoints; i++) { var currentV3 = pointCloud[i]; var currentV3Distance = currentV3.DistanceToPoint(targetPos); if(currentV3Distance &lt; nearestV3Distance) { nearestV3 = currentV3; nearestV3Distance = currentV3Distance; } } result1[x, y] = nearestV3; } sw1.Stop(); return sw1; } private static bool CheckResults(Vector3[,] lhs, Vector3[,] rhs) { for(int x = 0; x &lt; textureSize; x++) for(int y = 0; y &lt; textureSize; y++) if(!lhs[x, y].Equals(rhs[x, y])) return false; return true; } private static void Computation(Vector3[] pointCloud, Vector3[,] result, int x, int y) { var targetPos = new Vector3(x, y, 0); var nearestV3 = pointCloud[0]; var nearestV3Distance = nearestV3.DistanceToPoint(targetPos); for(int i = 1; i &lt; numPoints; i++) { var currentV3 = pointCloud[i]; var currentV3Distance = currentV3.DistanceToPoint(targetPos); if(currentV3Distance &lt; nearestV3Distance) { nearestV3 = currentV3; nearestV3Distance = currentV3Distance; } } result[x, y] = nearestV3; } struct Vector3 { public float x; public float y; public float z; public Vector3(float x, float y, float z) { this.x = x; this.y = y; this.z = z; } public Vector3(float randomDistance) { this.x = (float)rnd.NextDouble() * randomDistance; this.y = (float)rnd.NextDouble() * randomDistance; this.z = (float)rnd.NextDouble() * randomDistance; } public static Vector3 operator -(Vector3 a, Vector3 b) { return new Vector3(a.x - b.x, a.y - b.y, a.z - b.z); } public float sqrMagnitude() { return x * x + y * y + z * z; } public float DistanceToPoint(Vector3 point) { return (this - point).sqrMagnitude(); } } } } </code></pre> <h2>Update:</h2> <p>Based on what Feng Yuan pointed out about the methods not being inlined by the x64 JIT, you can change the program to do the calculations inline instead and get better performance out of the x64 version than the x86 version. This obviously sucks, but this is the kind of thing that I've seen the x64 JIT destroy before. Here's the new numbers:</p> <p><strong>After inlining x64:</strong></p> <pre><code>For Loop (Inline): 19032ms For Loop: 19209ms Parallel.For Loop: 3015ms </code></pre> <p>Inlined version of the code:</p> <pre><code>using System; using System.Diagnostics; using System.Threading.Tasks; namespace TextureFromPoints { class RevisedProgram { const int numPoints = 700; const int textureSize = 1024; static Random rnd = new Random(); static void Main(string[] args) { while(true) { Console.WriteLine("Starting REVISED"); Console.WriteLine(); var pointCloud = new Vector3[numPoints]; for(int i = 0; i &lt; numPoints; i++) pointCloud[i] = new Vector3(textureSize); var result1 = new Vector3[textureSize, textureSize]; var result2 = new Vector3[textureSize, textureSize]; var result3 = new Vector3[textureSize, textureSize]; var sw1 = Inline(pointCloud, result1); var sw2 = NotInline(pointCloud, result2); var sw3 = Parallelized(pointCloud, result3); Console.WriteLine("Completed {0}x{0} pixels with {1} points in...", textureSize, numPoints); Console.WriteLine("{0}: {1}ms", "For Loop (Inline)", sw1.ElapsedMilliseconds); Console.WriteLine("{0}: {1}ms", "For Loop", sw2.ElapsedMilliseconds); Console.WriteLine("{0}: {1}ms", "Parallel.For Loop", sw3.ElapsedMilliseconds); Console.WriteLine(); Console.Write("Verifying Data: "); Console.WriteLine(CheckResults(result1, result2) &amp;&amp; CheckResults(result1, result3) ? "Valid" : "Error"); Console.WriteLine(); Console.WriteLine(); Console.ReadLine(); } } private static Stopwatch Parallelized(Vector3[] pointCloud, Vector3[,] result3) { var sw3 = Stopwatch.StartNew(); Parallel.For(0, textureSize, x =&gt; { for(int y = 0; y &lt; textureSize; y++) Computation(pointCloud, result3, x, y); }); sw3.Stop(); return sw3; } private static Stopwatch NotInline(Vector3[] pointCloud, Vector3[,] result2) { var sw2 = Stopwatch.StartNew(); for(int x = 0; x &lt; textureSize; x++) for(int y = 0; y &lt; textureSize; y++) Computation(pointCloud, result2, x, y); sw2.Stop(); return sw2; } private static Stopwatch Inline(Vector3[] pointCloud, Vector3[,] result1) { var sw1 = Stopwatch.StartNew(); for(int x = 0; x &lt; textureSize; x++) for(int y = 0; y &lt; textureSize; y++) { var targetPos = new Vector3(x, y, 0); var nearestV3 = pointCloud[0]; Vector3 temp1 = nearestV3 - targetPos; var nearestV3Distance = temp1.x * temp1.x + temp1.y * temp1.y + temp1.z * temp1.z; for(int i = 1; i &lt; numPoints; i++) { var currentV3 = pointCloud[i]; Vector3 temp2 = currentV3 - targetPos; var currentV3Distance = temp2.x * temp2.x + temp2.y * temp2.y + temp2.z * temp2.z; if(currentV3Distance &lt; nearestV3Distance) { nearestV3 = currentV3; nearestV3Distance = currentV3Distance; } } result1[x, y] = nearestV3; } sw1.Stop(); return sw1; } private static bool CheckResults(Vector3[,] lhs, Vector3[,] rhs) { for(int x = 0; x &lt; textureSize; x++) for(int y = 0; y &lt; textureSize; y++) if(!lhs[x, y].Equals(rhs[x, y])) return false; return true; } private static void Computation(Vector3[] pointCloud, Vector3[,] result, int x, int y) { var targetPos = new Vector3(x, y, 0); var nearestV3 = pointCloud[0]; Vector3 temp1 = nearestV3 - targetPos; var nearestV3Distance = temp1.x * temp1.x + temp1.y * temp1.y + temp1.z * temp1.z; for(int i = 1; i &lt; numPoints; i++) { var currentV3 = pointCloud[i]; Vector3 temp2 = currentV3 - targetPos; var currentV3Distance = temp2.x * temp2.x + temp2.y * temp2.y + temp2.z * temp2.z; if(currentV3Distance &lt; nearestV3Distance) { nearestV3 = currentV3; nearestV3Distance = currentV3Distance; } } result[x, y] = nearestV3; } private static float DistanceToPoint(Vector3 vector, Vector3 point) { Vector3 final = vector - point; return final.x * final.x + final.y * final.y + final.z * final.z; } struct Vector3 { public float x; public float y; public float z; public Vector3(float x, float y, float z) { this.x = x; this.y = y; this.z = z; } public Vector3(float randomDistance) { this.x = (float)rnd.NextDouble() * randomDistance; this.y = (float)rnd.NextDouble() * randomDistance; this.z = (float)rnd.NextDouble() * randomDistance; } public static Vector3 operator -(Vector3 a, Vector3 b) { return new Vector3(a.x - b.x, a.y - b.y, a.z - b.z); } } } } </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload