Note that there are some explanatory texts on larger screens.

plurals
  1. POWhy reading a volatile and writing to a field member is not scalable in Java?
    primarykey
    data
    text
    <p>Observe the following program written in Java (complete runnable version follows, but the important part of the program is in the snippet a little bit further below):</p> <pre><code>import java.util.ArrayList; /** A not easy to explain benchmark. */ class MultiVolatileJavaExperiment { public static void main(String[] args) { (new MultiVolatileJavaExperiment()).mainMethod(args); } int size = Integer.parseInt(System.getProperty("size")); int par = Integer.parseInt(System.getProperty("par")); public void mainMethod(String[] args) { int times = 0; if (args.length == 0) times = 1; else times = Integer.parseInt(args[0]); ArrayList &lt; Long &gt; measurements = new ArrayList &lt; Long &gt; (); for (int i = 0; i &lt; times; i++) { long start = System.currentTimeMillis(); run(); long end = System.currentTimeMillis(); long time = (end - start); System.out.println(i + ") Running time: " + time + " ms"); measurements.add(time); } System.out.println("&gt;&gt;&gt;"); System.out.println("&gt;&gt;&gt; All running times: " + measurements); System.out.println("&gt;&gt;&gt;"); } public void run() { int sz = size / par; ArrayList &lt; Thread &gt; threads = new ArrayList &lt; Thread &gt; (); for (int i = 0; i &lt; par; i++) { threads.add(new Reader(sz)); threads.get(i).start(); } for (int i = 0; i &lt; par; i++) { try { threads.get(i).join(); } catch (Exception e) {} } } final class Foo { int x = 0; } final class Reader extends Thread { volatile Foo vfoo = new Foo(); Foo bar = null; int sz; public Reader(int _sz) { sz = _sz; } public void run() { int i = 0; while (i &lt; sz) { vfoo.x = 1; // with the following line commented // the scalability is almost linear bar = vfoo; // &lt;- makes benchmark 2x slower for 2 processors - why? i++; } } } } </code></pre> <p><strong>Explanation</strong>: The program is actually very simple. It loads integers <code>size</code> and <code>par</code> from the system properties (passed to jvm with the <code>-D</code> flag) - these are the input length and the number of threads to use later. It then parses the first command line argument which says how many time to repeat the program (we want to be sure that the JIT has done its work and have more reliable measurements).</p> <p>The <code>run</code> method is called in each repetition. This method simply starts <code>par</code> threads, each of which will do a loop with <code>size / par</code> iterations. The thread body is defined in the <code>Reader</code> class. Each repetition of the loop reads a volatile member <code>vfoo</code> and assigns <code>1</code> to its public field. After that, <code>vfoo</code> is read once again and assigned to a <em>non-volatile</em> field <code>bar</code>.</p> <p>Notice how most of the time the program is executing the loop body, so the <code>run</code> in the thread is the focus of this benchmark:</p> <pre><code> final class Reader extends Thread { volatile Foo vfoo = new Foo(); Foo bar = null; int sz; public Reader(int _sz) { sz = _sz; } public void run() { int i = 0; while (i &lt; sz) { vfoo.x = 1; // with the following line commented // the scalability is almost linear bar = vfoo; // &lt;- makes benchmark 2x slower for 2 processors - why? i++; } } } </code></pre> <p><strong>Observations</strong>: Running <code>java -Xmx512m -Xms512m -server -Dsize=500000000 -Dpar=1 MultiVolatileJavaExperiment 10</code> on an </p> <pre><code>Ubuntu Server 10.04.3 LTS 8 core Intel(R) Xeon(R) CPU X5355 @2.66GHz ~20GB ram java version "1.6.0_26" Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) </code></pre> <p>I get the following times:</p> <pre><code>&gt;&gt;&gt; All running times: [821, 750, 1011, 750, 758, 755, 1219, 751, 751, 1012] </code></pre> <p>Now, setting <code>-Dpar=2</code>, I get:</p> <pre><code>&gt;&gt;&gt; All running times: [1618, 380, 1476, 1245, 1390, 1391, 1445, 1393, 1511, 1508] </code></pre> <p>Apparently, this doesn't scale for some reason - I would have expected the second output to be twice as fast (although it does seem to be in one of the early iterations - <code>380ms</code>).</p> <p>Interestingly, commenting out the line <code>bar = vfoo</code> (which isn't even supposed to be a volatile write), yields the following times for <code>-Dpar</code> set to <code>1,2,4,8</code>.</p> <pre><code>&gt;&gt;&gt; All running times: [762, 563, 563, 563, 563, 563, 570, 566, 563, 563] &gt;&gt;&gt; All running times: [387, 287, 285, 284, 283, 281, 282, 282, 281, 282] &gt;&gt;&gt; All running times: [204, 146, 143, 142, 141, 141, 141, 141, 141, 141] &gt;&gt;&gt; All running times: [120, 78, 74, 74, 81, 75, 73, 73, 72, 71] </code></pre> <p>It scales perfectly.</p> <p><strong>Analysis</strong>: First of all, there are no garbage collection cycles occuring here (I've added <code>-verbose:gc</code> as well to check this).</p> <p>I get similar results on my iMac.</p> <p>Each thread is writing to its own field, and different <code>Foo</code> object instances belonging to different threads don't appear to be ending up in the same cachelines - adding more members into <code>Foo</code> to increase its size doesn't change the measurements. Each thread object instance has more than enough fields to fill up the L1 cache line. So this probably isn't a memory issue.</p> <p>My next thought was that the <code>JIT</code> might be doing something weird, because the early iterations usually <em>do</em> scale as expected in the uncommented version, so I checked this out by printing the assembly (see <a href="https://wikis.oracle.com/display/HotSpotInternals/PrintAssembly">this post on how to do that</a>).</p> <pre><code>java -Xmx512m -Xms512m -server -XX:CompileCommand=print,*Reader.run MultiVolatileJavaExperiment -Dsize=500000000 -Dpar=1 10 </code></pre> <p>and I get these 2 outputs for the 2 versions for the Jitted method <code>run</code> in <code>Reader</code>. The commented (properly scalable) version:</p> <pre><code>[Verified Entry Point] 0xf36c9fac: mov %eax,-0x3000(%esp) 0xf36c9fb3: push %ebp 0xf36c9fb4: sub $0x8,%esp 0xf36c9fba: mov 0x68(%ecx),%ebx 0xf36c9fbd: test %ebx,%ebx 0xf36c9fbf: jle 0xf36c9fec 0xf36c9fc1: xor %ebx,%ebx 0xf36c9fc3: nopw 0x0(%eax,%eax,1) 0xf36c9fcc: xchg %ax,%ax 0xf36c9fd0: mov 0x6c(%ecx),%ebp 0xf36c9fd3: test %ebp,%ebp 0xf36c9fd5: je 0xf36c9ff7 0xf36c9fd7: movl $0x1,0x8(%ebp) --------------------------------------------- 0xf36c9fde: mov 0x68(%ecx),%ebp 0xf36c9fe1: inc %ebx ; OopMap{ecx=Oop off=66} ;*goto ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@21 (line 83) --------------------------------------------- 0xf36c9fe2: test %edi,0xf7725000 ; {poll} 0xf36c9fe8: cmp %ebp,%ebx 0xf36c9fea: jl 0xf36c9fd0 0xf36c9fec: add $0x8,%esp 0xf36c9fef: pop %ebp 0xf36c9ff0: test %eax,0xf7725000 ; {poll_return} 0xf36c9ff6: ret 0xf36c9ff7: mov $0xfffffff6,%ecx 0xf36c9ffc: xchg %ax,%ax 0xf36c9fff: call 0xf36a56a0 ; OopMap{off=100} ;*putfield x ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@15 (line 79) ; {runtime_call} 0xf36ca004: call 0xf6f877a0 ; {runtime_call} </code></pre> <p>The uncommented <code>bar = vfoo</code> (non-scalable, slower) version:</p> <pre><code>[Verified Entry Point] 0xf3771aac: mov %eax,-0x3000(%esp) 0xf3771ab3: push %ebp 0xf3771ab4: sub $0x8,%esp 0xf3771aba: mov 0x68(%ecx),%ebx 0xf3771abd: test %ebx,%ebx 0xf3771abf: jle 0xf3771afe 0xf3771ac1: xor %ebx,%ebx 0xf3771ac3: nopw 0x0(%eax,%eax,1) 0xf3771acc: xchg %ax,%ax 0xf3771ad0: mov 0x6c(%ecx),%ebp 0xf3771ad3: test %ebp,%ebp 0xf3771ad5: je 0xf3771b09 0xf3771ad7: movl $0x1,0x8(%ebp) ------------------------------------------------- 0xf3771ade: mov 0x6c(%ecx),%ebp 0xf3771ae1: mov %ebp,0x70(%ecx) 0xf3771ae4: mov 0x68(%ecx),%edi 0xf3771ae7: inc %ebx 0xf3771ae8: mov %ecx,%eax 0xf3771aea: shr $0x9,%eax 0xf3771aed: movb $0x0,-0x3113c300(%eax) ; OopMap{ecx=Oop off=84} ;*goto ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@29 (line 83) ----------------------------------------------- 0xf3771af4: test %edi,0xf77ce000 ; {poll} 0xf3771afa: cmp %edi,%ebx 0xf3771afc: jl 0xf3771ad0 0xf3771afe: add $0x8,%esp 0xf3771b01: pop %ebp 0xf3771b02: test %eax,0xf77ce000 ; {poll_return} 0xf3771b08: ret 0xf3771b09: mov $0xfffffff6,%ecx 0xf3771b0e: nop 0xf3771b0f: call 0xf374e6a0 ; OopMap{off=116} ;*putfield x ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@15 (line 79) ; {runtime_call} 0xf3771b14: call 0xf70307a0 ; {runtime_call} </code></pre> <p>The differences in the two versions are within <code>---------</code>. I expected to find synchronization instructions in the assembly which might account for the performance issue - while few extra <code>shift</code>, <code>mov</code> and <code>inc</code> instructions might affect absolute performance numbers, I don't see how they could affect scalability.</p> <p>So, I suspect that this is some sort of a memory issue related to storing to a field in the class. On the other hand, I'm also inclined to believe that the JIT does something funny, because in one iteration the measured time <em>is</em> twice as fast, as it should be.</p> <p>Can anyone explain what is going on here? Please be precise and include references that support your claims.</p> <p>Thank you!</p> <p>EDIT:</p> <p>Here's the bytecode for the fast (scalable) version:</p> <pre><code>public void run(); LineNumberTable: line 77: 0 line 78: 2 line 79: 10 line 83: 18 line 85: 24 Code: Stack=2, Locals=2, Args_size=1 0: iconst_0 1: istore_1 2: iload_1 3: aload_0 4: getfield #7; //Field sz:I 7: if_icmpge 24 10: aload_0 11: getfield #5; //Field vfoo:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo; 14: iconst_1 15: putfield #8; //Field org/scalapool/bench/MultiVolatileJavaExperiment$Foo.x:I 18: iinc 1, 1 21: goto 2 24: return LineNumberTable: line 77: 0 line 78: 2 line 79: 10 line 83: 18 line 85: 24 StackMapTable: number_of_entries = 2 frame_type = 252 /* append */ offset_delta = 2 locals = [ int ] frame_type = 21 /* same */ </code></pre> <p>The slow (non-scalable) version with <code>bar = vfoo</code>:</p> <pre><code>public void run(); LineNumberTable: line 77: 0 line 78: 2 line 79: 10 line 82: 18 line 83: 26 line 85: 32 Code: Stack=2, Locals=2, Args_size=1 0: iconst_0 1: istore_1 2: iload_1 3: aload_0 4: getfield #7; //Field sz:I 7: if_icmpge 32 10: aload_0 11: getfield #5; //Field vfoo:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo; 14: iconst_1 15: putfield #8; //Field org/scalapool/bench/MultiVolatileJavaExperiment$Foo.x:I 18: aload_0 19: aload_0 20: getfield #5; //Field vfoo:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo; 23: putfield #6; //Field bar:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo; 26: iinc 1, 1 29: goto 2 32: return LineNumberTable: line 77: 0 line 78: 2 line 79: 10 line 82: 18 line 83: 26 line 85: 32 StackMapTable: number_of_entries = 2 frame_type = 252 /* append */ offset_delta = 2 locals = [ int ] frame_type = 29 /* same */ </code></pre> <p>The more I am experimenting with this, it seems to me that this has nothing to do with volatiles at all - it has something to do with writing to object fields. My hunch is that this is somehow a memory contention issue - something with caches and false sharing, although there is no explicit synchronization at all.</p> <p>EDIT 2:</p> <p>Interestingly, changing the program like this:</p> <pre><code>final class Holder { public Foo bar = null; } final class Reader extends Thread { volatile Foo vfoo = new Foo(); Holder holder = null; int sz; public Reader(int _sz) { sz = _sz; } public void run() { int i = 0; holder = new Holder(); while (i &lt; sz) { vfoo.x = 1; holder.bar = vfoo; i++; } } } </code></pre> <p>resolves the scaling issue. Apparently, the <code>Holder</code> object above gets created after the thread is started, and is probably allocated in a different segment of memory, which is then being modified concurrently, as opposed to modifying the field <code>bar</code> in the thread object, which is somehow "close" in memory between different thread instances.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload