StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POArray allocation and access on the Java Virtual Machine and memory contention
primarykey
Id
8942396
data
AcceptedAnswerId
9016266
AnswerCount
2
ClosedDate
CommentCount
5
CommunityOwnedDate
CreationDate
2012-01-20T13:51:15.307
FavoriteCount
10
LastActivityDate
2012-01-26T09:47:48.993
LastEditDate
LastEditorUserId
0
OwnerUserId
453594
ParentId
0
PostTypeId
1
Score
18
ViewCount
2905
LastEditorDisplayName
text
Body
Observe the following definition of a thread subclass (the entire runnable Java source file is included at the end of the question for your convenience): <pre><code>final class Worker extends Thread { Foo[] array = new Foo[1024]; int sz; public Worker(int _sz) { sz = _sz; } public void run() { //Foo[] arr = new Foo[1024]; Foo[] arr = array; loop(arr); } public void loop(Foo[] arr) { int i = 0; int pos = 512; Foo v = new Foo(); while (i < sz) { if (i % 2 == 0) { arr[pos] = v; pos += 1; } else { pos -= 1; v = arr[pos]; } i++; } } } </code></pre> Explanation: The program starts <code>-Dpar</code> such threads, and sets the <code>sz</code> of each thread to <code>-Dsize / -Dpar</code>, where <code>-Dsize</code> and <code>-Dpar</code> are set through the command line when running the program. Each thread object has a field <code>array</code> which is initialized with a fresh <code>1024</code>-element array. The reasoning is that we want to divide an equal amount of work between a different number of threads - we expect the program to scale. Each thread is then started and the time needed for all the threads to complete is measured. We do multiple measurements to counter any JIT related effects, as shown below. Each thread does a loop. Within the loop, the thread reads an element at the position <code>512</code> in the array in even iterations, and writes the same element at <code>512</code> in odd iterations. Only local variables are modified otherwise. Full program is below. Analysis: Tested with <code>-verbose:gc</code> - there is no garbage collection occurring during the run of this program. Run command: <pre><code>java -Xmx512m -Xms512m -server -Dsize=500000000 -Dpar=1 org.scalapool.bench.MultiStackJavaExperiment 7 </code></pre> CASE 1: Running times for <code>1,2,4,8</code> threads, in that order (7 repetitions): <pre><code>>>> All running times: [2149, 2227, 1974, 1948, 1803, 2283, 1878] >>> All running times: [1140, 1124, 2022, 1141, 2028, 2004, 2136] >>> All running times: [867, 1022, 1457, 1342, 1436, 966, 1531] >>> All running times: [915, 864, 1245, 1243, 948, 790, 1007] </code></pre> My thought was that the nonlinear scaling is due to memory contention. Incidentally, early iterations actually do better - this might be due to the fact that in different iterations the arrays are allocated in different memory areas. CASE 2: Next, I comment the <code>Foo[] arr = array</code> line in the <code>run</code> method of the thread and allocate a new array in the <code>run</code> method itself: <code>Foo[] arr = new Foo[1024]</code>. Measurements: <pre><code>>>> All running times: [2053, 1966, 2089, 1937, 2046, 1909, 2011] >>> All running times: [1048, 1178, 1100, 1194, 1367, 1271, 1207] >>> All running times: [578, 508, 589, 571, 617, 643, 645] >>> All running times: [330, 299, 300, 322, 331, 324, 575] </code></pre> This time, everything scales pretty much as expected. I wouldn't have imagined that the location where the array was allocated plays any role whatsoever, but obviously it does somehow. My thought was that the arrays were previously allocated so close to each other that some memory contention started happening. CASE 3: To verify this assumption, I've uncommented the line <code>Foo[] arr = array</code> again, but this time initialized the <code>array</code> field to <code>new Foo[32000]</code> to ensure that the location in memory being written to are sufficiently far from each other. So, here we're using the array allocated during the creation of the thread object again, the difference with CASE1 is only that the array is bigger. <pre><code>>>> All running times: [2113, 1983, 2430, 2485, 2333, 2359, 2463] >>> All running times: [1172, 1106, 1163, 1181, 1142, 1169, 1188] >>> All running times: [578, 677, 614, 604, 583, 637, 597] >>> All running times: [343, 327, 320, 330, 353, 320, 320] </code></pre> So, memory contention seems to be the cause of this. The platform information: <pre><code>Ubuntu Server 10.04.3 LTS 8 core Intel(R) Xeon(R) CPU X5355 @2.66GHz ~20GB ram java version "1.6.0_26" Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) </code></pre> Question: This is obviously a memory-contention issue. But why is this happening? <ol> <li>Is the escape analysis kicking in? If so, does it mean that the entire array is allocated on the stack when created in the <code>run</code> method in CASE2? What are the exact conditions for this runtime optimization? Surely the array is not allocated on the stack for 1 million elements?</li> <li>Even if the array is being allocated on the stack as opposed to being allocated on the heap, two array accesses by different threads should be divided by at least 512 * 4bytes = 2kb even in CASE1, wherever the arrays are! That's definitely larger than any L1 cache-line. If these effects are due to false sharing, how can writes to several totally independent cache-lines affect performance this much? (One assumption here is that each array occupies a contiguous block of memory on the JVM, which is allocated when the array is created. I'm not sure this is valid. Another assumption is that array writes don't go all the way to memory, but L1 cache instead, as Intel Xeon does have a ccNUMA architecture - correct me if I'm wrong)</li> <li>Is it possible that each thread has its own local heap part where it independently allocates new objects, and this is the cause for lower contention when the array is allocated in the thread? If so, how is that area of heap garbage collected if references get shared?</li> <li>Why has increasing the array size to ~32000 elements improved scalability (decreased memory contention)? What exactly in the memory hierarchy is the cause of this?</li> </ol> Please be precise and support your claims with references. Thank you! <hr> The entire runnable Java program: <pre><code>import java.util.ArrayList; class MultiStackJavaExperiment { final class Foo { int x = 0; } final class Worker extends Thread { Foo[] array = new Foo[1024]; int sz; public Worker(int _sz) { sz = _sz; } public void run() { Foo[] arr = new Foo[1024]; //Foo[] arr = array; loop(arr); } public void loop(Foo[] arr) { int i = 0; int pos = 512; Foo v = new Foo(); while (i < sz) { if (i % 2 == 0) { arr[pos] = v; pos += 1; } else { pos -= 1; v = arr[pos]; } i++; } } } public static void main(String[] args) { (new MultiStackJavaExperiment()).mainMethod(args); } int size = Integer.parseInt(System.getProperty("size")); int par = Integer.parseInt(System.getProperty("par")); public void mainMethod(String[] args) { int times = 0; if (args.length == 0) times = 1; else times = Integer.parseInt(args[0]); ArrayList < Long > measurements = new ArrayList < Long > (); for (int i = 0; i < times; i++) { long start = System.currentTimeMillis(); run(); long end = System.currentTimeMillis(); long time = (end - start); System.out.println(i + ") Running time: " + time + " ms"); measurements.add(time); } System.out.println(">>>"); System.out.println(">>> All running times: " + measurements); System.out.println(">>>"); } public void run() { int sz = size / par; ArrayList < Thread > threads = new ArrayList < Thread > (); for (int i = 0; i < par; i++) { threads.add(new Worker(sz)); threads.get(i).start(); } for (int i = 0; i < par; i++) { try { threads.get(i).join(); } catch (Exception e) {} } } } </code></pre>
Tags
<java><arrays><memory><memory-management><concurrency>
Title
Array allocation and access on the Java Virtual Machine and memory contention
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USaxel22
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POArray allocation and access on the Java Virtual Machine and memory contention
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POArray allocation and access on the Java Virtual Machine and memory contention
 UserUserId
 USivantod
 VoteTypeVoteTypeId
 VTFavorite
3. VO
 singulars
 PostPostId
 POArray allocation and access on the Java Virtual Machine and memory contention
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.