Note that there are some explanatory texts on larger screens.

plurals
  1. POHung JVM consuming 100% CPU
    primarykey
    data
    text
    <p>We have a JAVA server running on Sun JRE 6u20 on Linux 32-bit (CentOS). We use the Server Hotspot with CMS collector with the following options (I've only provided the relevant ones): </p> <pre><code>-Xmx896m -Xss128k -XX:NewSize=384M -XX:MaxPermSize=96m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC </code></pre> <p>Sometimes, after running for a while, the JVM seems to slip into a hung state, whereby even though we don't make any requests to the application, the CPU continues to spin at 100% (we have 8 logical CPUs, so it looks like only one CPU does the spinning). In this state the JVM doesn't respond to SIGHUP signals (kill -3) and we can't connect to it normally with jstack. We CAN connect with "jstack -F", but the output is dodgy (we can see lots of NullPointerExceptions from JStack apparently because it wasn't able to 'walk' some stacks). So the "jstack -F" output seems to be useless.</p> <p>We have run a stack dump from "gdb" though, and we were able to match the thread id that spins the CPU (we found that using "top" with a per-thread view - "H" option) with a thread stack that appears in the gdb result and this is how it looks like:</p> <pre><code>Thread 443 (Thread 0x7e5b90 (LWP 26310)): #0 0x0115ebd3 in CompactibleFreeListSpace::block_size(HeapWord const*) const () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #1 0x01160ff9 in CompactibleFreeListSpace::prepare_for_compaction(CompactPoint*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #2 0x0123456c in Generation::prepare_for_compaction(CompactPoint*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #3 0x01229b2c in GenCollectedHeap::prepare_for_compaction() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #4 0x0122a7fc in GenMarkSweep::invoke_at_safepoint(int, ReferenceProcessor*, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #5 0x01186024 in CMSCollector::do_compaction_work(bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #6 0x011859ee in CMSCollector::acquire_control_and_collect(bool, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #7 0x01185705 in ConcurrentMarkSweepGeneration::collect(bool, bool, unsigned int, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #8 0x01227f53 in GenCollectedHeap::do_collection(bool, bool, unsigned int, bool, int) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #9 0x0115c7b5 in GenCollectorPolicy::satisfy_failed_allocation(unsigned int, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #10 0x0122859c in GenCollectedHeap::satisfy_failed_allocation(unsigned int, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #11 0x0158a8ce in VM_GenCollectForAllocation::doit() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #12 0x015987e6 in VM_Operation::evaluate() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #13 0x01597c93 in VMThread::evaluate_operation(VM_Operation*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #14 0x01597f0f in VMThread::loop() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #15 0x015979f0 in VMThread::run() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #16 0x0145c24e in java_start(Thread*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so #17 0x00ccd46b in start_thread () from /lib/libpthread.so.0 #18 0x00bc2dbe in clone () from /lib/libc.so.6 </code></pre> <p>It seems that a JVM thread is spinning while doing some CMS related work. We have checked the memory usage on the box, there seems to be enough memory available and the system is not swapping. Has anyone come across such a situation? Does it look like a JVM bug?</p> <p><strong>UPDATE</strong></p> <p>I've obtained some more information about this problem (it happened again on a server that has been running for more than 7 days). When the JVM entered the "hung" state it stayed like that for 2 hours until the server was manually restarted. We have obtained a core dump of the process and the gc log. We tried to get a heap dump as well, but "jmap" failed. We tried to use jmap -F but then only a 4Mb file was written before the program aborted with an exception (something about the a memory location not being accessible). So far I think the most interesting information comes from the gc log. It seems that the GC logging stopped as well (possibly at the time when the VM thread went into the long loop):</p> <pre><code>657501.199: [Full GC (System) 657501.199: [CMS: 400352K-&gt;313412K(524288K), 2.4024120 secs] 660634K-&gt;313412K(878208K), [CMS Perm : 29455K-&gt;29320K(68568K)], 2.4026470 secs] [Times: user=2.39 sys=0.01, real=2.40 secs] 657513.941: [GC 657513.941: [ParNew: 314624K-&gt;13999K(353920K), 0.0228180 secs] 628036K-&gt;327412K(878208K), 0.0230510 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 657523.772: [GC 657523.772: [ParNew: 328623K-&gt;17110K(353920K), 0.0244910 secs] 642036K-&gt;330523K(878208K), 0.0247140 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 657535.473: [GC 657535.473: [ParNew: 331734K-&gt;20282K(353920K), 0.0259480 secs] 645147K-&gt;333695K(878208K), 0.0261670 secs] [Times: user=0.11 sys=0.00, real=0.02 secs] .... .... 688346.765: [GC [1 CMS-initial-mark: 485248K(524288K)] 515694K(878208K), 0.0343730 secs] [Times: user=0.03 sys=0.00, real=0.04 secs] 688346.800: [CMS-concurrent-mark-start] 688347.964: [CMS-concurrent-mark: 1.083/1.164 secs] [Times: user=2.52 sys=0.09, real=1.16 secs] 688347.964: [CMS-concurrent-preclean-start] 688347.969: [CMS-concurrent-preclean: 0.004/0.005 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] 688347.969: [CMS-concurrent-abortable-preclean-start] CMS: abort preclean due to time 688352.986: [CMS-concurrent-abortable-preclean: 2.351/5.017 secs] [Times: user=3.83 sys=0.38, real=5.01 secs] 688352.987: [GC[YG occupancy: 297806 K (353920 K)]688352.987: [Rescan (parallel) , 0.1815250 secs]688353.169: [weak refs processing, 0.0312660 secs] [1 CMS-remark: 485248K(524288K)] 783055K(878208K), 0.2131580 secs] [Times: user=1.13 sys =0.00, real=0.22 secs] 688353.201: [CMS-concurrent-sweep-start] 688353.903: [CMS-concurrent-sweep: 0.660/0.702 secs] [Times: user=0.91 sys=0.07, real=0.70 secs] 688353.903: [CMS-concurrent-reset-start] 688353.912: [CMS-concurrent-reset: 0.008/0.008 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 688354.243: [GC 688354.243: [ParNew: 344928K-&gt;30151K(353920K), 0.0305020 secs] 681955K-&gt;368044K(878208K), 0.0308880 secs] [Times: user=0.15 sys=0.00, real=0.03 secs] .... .... 688943.029: [GC 688943.029: [ParNew: 336531K-&gt;17143K(353920K), 0.0237360 secs] 813250K-&gt;494327K(878208K), 0.0241260 secs] [Times: user=0.10 sys=0.00, real=0.03 secs] 688950.620: [GC 688950.620: [ParNew: 331767K-&gt;22442K(353920K), 0.0344110 secs] 808951K-&gt;499996K(878208K), 0.0347690 secs] [Times: user=0.11 sys=0.00, real=0.04 secs] 688956.596: [GC 688956.596: [ParNew: 337064K-&gt;37809K(353920K), 0.0488170 secs] 814618K-&gt;515896K(878208K), 0.0491550 secs] [Times: user=0.18 sys=0.04, real=0.05 secs] 688961.470: [GC 688961.471: [ParNew (promotion failed): 352433K-&gt;332183K(353920K), 0.1862520 secs]688961.657: [CMS </code></pre> <p>I suspect this problem has something to do with the last line in the log (I've added some "...." in order to skip some lines that were not interesting). The fact that the server stayed in the hung state for 2 hours (probably trying to GC and compact the old generation) seems quite strange to me. Also, the gc log stops suddenly with that message and nothing else gets printed any more, probably because the VM Thread gets into some sort of infinite loop (or something that takes 2+ hours).</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. COIf you are running in low memory conditions that is not impossible to observe. But do update the VM, there were some GC big fixes in u21/22, iirc. Also MaxPermSize=96m is very low, I dunno what you run, though.
      singulars
    2. COI'm fairly sure that the VM is not running in low in memory. From the GC log Every time there is a Full GC, the OldGen gets back to an average of 370Mb, which means that there is about 27% free space in it. The young generation (384Mb) is pretty much emptied at each cleanup. So I think there's plenty of memory. The problem seems to happen just before the Full GC is run (as you can see from the logs) when the OldGen is almost full and the YoungGen collector can't promote some objects. Probably the VM realises that OldGen is too fragmented, so it triggers a compaction (which lasts for 2+ hours).
      singulars
    3. COObserving a similar pattern with server JRE 6u26 on Windows Server 2008. ParNew promotion failed followed by that fragment of a CMS collection entry: <<911188.231: [GC 911188.231: [ParNew (promotion failed): 750602K->753650K(811008K), 3.7030115 secs]911191.934: [CMS>> The old gen had 161582K available at that point, which is comparable to the levels at which CMS collections were triggered previously in the run. Wondering if you ever had any further insights into your problem, Bogdan?
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload