GC Bomb - CMS GC starves high priority system threads

swpalmer · July 26, 2005, 1:06am

Not game related, but it might interest you just the same.

My company makes video encoding stuff. We have our own capture cards and a fancy app that does video encoding and braocasting using various codecs. The UI is written in Java. It runs on Windows and the same UI code provides a remote interface to our encoders on Linux and Mac. My test were done on 1.5.0_01, since 1.5.0_03 or _04 will crash hard in jvm.dll if you let it run for 30 hours or so.

Recently while testing we found that the hardware was dropping frames of video because the software wasn’t emptying the hardware’s FIFO fast enough. This happened when the Concurrent Mark Sweep collector did a full sweep. There is a driver thread that runs at the highest windows priority (31) that is responsible for emptying the hardware FIFO. That thread is running in the System process, quite separate from the UI application.

After much experimenting I found that the ‘sweep’ phase of the CMS collector would run for a bit over 200 ms. During that time there was a lot of thread context switching and it managed to starve the high priority driver thread for long enough (100ms) that the hardware FIFO overflowed.

Incidentally, my code during a test does nothing but query some stats from another process and display them on a Swing UI. It has one indeterminant progress bar and several JLables that update once a second. As far as I know I generate no new objects that should live more than one second during this process. Most of the time the app is idle. A polling thread wakes up every second and collects some stats via JNI.

I switched to the openGL pipeline and the problem went away. Checking the GC activity showed that with the default graphics pipeline on windows or with nodraw=true the CMS sweep was as I said 200ms+. Simply switching to the OpenGL pipeline and changing nothing else in the application the time taken for the CMS sweep phase was reduced by a factor of 10, though it happened a bit more frequently. Lowering the max heap size (to 20MB) also made the problem go away, by increasing the freqency and shortening the duration of the CMS sweep phase.

Because the OpenGL pipeline was not stable on all the hardware we needed to support, we searched for another solution and found it with the -XX:+CMSIncrementalMode option (used in addition to the -XX:+UseConcMarkSweepGC that we were already using).

What I find most disturbing about this is that the GC thread could possibly starve a thread in the system process that is set to the highest possible priority. I have many questions:

What priority does the GC thread run at?
What about the finalizer thread?
Why do any objects accumulate in the tenured generation at all? (Every object I make is small and lives for at most one second. The young generation should easily deal with them.)
My thread does little more than while(true) { lable.setText("frame "+x); Thread.sleep(1000); }, though with a few more JLabels, a hidden JTable, and a progress bar animating.

Next I have to figure out why jvm.dll always crashes at the same spot after running for a couple days… though for now 1.5.0_01 will save me from that.

princec · July 26, 2005, 9:07am

I noticed that something in 1.5 starves the entire system making the mouse judder - unheard of in Windows XP before. But no-one saw fit to investigate the problem despite it being very severe. Was hoping Azeem or Jeff would pick it up…

Cas

swpalmer · July 26, 2005, 2:26pm

That’s very interesting Cas, It just so happens that many years ago when I was programming a time-sensitive thing on Windows NT, I had a process that ran at the highest priority in a user-mode thread and I found that an animated mouse pointer would block my thread and totally hose my code. The solution was to write a driver and run my code in an interrupt (it needed to do time-sensitive access to the serial port to talk to high end video controllers and respond to their requests in a timely manner).

So this lead me to believe that the code controlling the mouse pointer runs at an extremely high priority, likely part of an ISR, even when it is doing trivial things like making a stupid pink dinosaur walk. The fact that the JVM might somehow block such high-priority code is very disturbing.

My first guess was that finalizers involved with freeing direct draw surfaces were making the video driver steal CPU in some high-priority driver code. But I experienced the same problem with -Dsun.java2d.noddraw=true

swpalmer · July 26, 2005, 2:42pm

I should have also stated that I’m running this stuff on a dual processor machine. As I understand it the CMS collector uses a single thread, my application thread is mostly blocked anyway, AWT/Swing will have a timer thread to animated the indeterminate progress bar, the driver thread will be emptying the hardware FIFO, and there will be an encoder process that is pushing the video frames through some codec (MPEG2 or WM9) on one or two threads.

During the encoding process the CPU usage is around 21%, plenty of head room - until the GC kicks in and kills everything.

princec · July 26, 2005, 3:24pm

The interesting thing for me here is that this occurs during Eclipse development as well - which hasn’t got a whiff of AWT in it, it’s all SWT. But most telling of all is that no such juddering and jerkiness occurs whatsoever under 1.4.2.

The thing is, this is such an obvious big issue and so immediately, annoyingly noticeable, I’m not sure why no-one at Sun has noticed it and started to undo it?

Cas

Raghar · July 26, 2005, 8:51pm

I still wait for Cas detailed writing about messing with his system to get some idea what caused that problem.

What I listened about it, it looks like something in DirectX pipeline.

princec · July 27, 2005, 10:17am

I have no idea how to pinpoint the bit that’s causing the lock but Scott’s discovery that enabling the OpenGL pipeline is intriguing. I suspected DirectX but had up till now discounted it because the problem manifested in Eclipse which doesn’t use AWT. Also, no-one seems to have reported the problem on Linux…

Cas

swpalmer · July 27, 2005, 1:15pm

Ah… but I think the OpenGL pipeline thing is a red herring. I emailed Chris Campbell about that and he explained that the openGL pipeline simply generated less garbage due to the way it handed off more operations to OpenGL. My problem occurs with even with -Dsun.java2d.noddraw=true which should mean that only GDI calls are used.

Chris wrote: [quote]I think the reason you see more frequent/smaller GCs with the OGL pipeline is that it is able to perform many operations directly via OpenGL with few intermediate operations. For example, if your application does a lot of bilinear image scaling, the OGL pipeline can handle this directly, whereas the DDraw or GDI pipelines can only handle this using our software loops. Those software loops might require conversion from one image format to another, and therefore temporary intermediate images may be created frequently. More intermediate images means increased burden on GC.

Note that this is just one example of a situation where your app might create more garbage with one pipeline versus another. Have you used the -Dsun.java2d.trace flag?
http://java.sun.com/j2se/1.5.0/docs/guide/2d/flags.html#trace

I would suggest that you try running your app with this flag, first with OGL, then with the other pipelines, and then compare the logs. If you see lots of primitives with “General” in their name, then that’s a good indication that an operation is using intermediate surfaces, and possibly creating more garbage.
[/quote]
I think the sweep phase of the collection might still be blocking high-priority threads in an odd way, but simply for shorter times due to the characteristics of the garbage generated. And of course even GDI calls go through the graphics driver in some form eventually - but I’m afraid I don’t know enough about things at that level.

I also recently found this: http://java.sun.com/performance/reference/whitepapers/5.0_performance.html#2.10
which mentions:

[quote]The Concurrent Low Pause Garbage Collector, also known as CMS, has had several improvements for J2SE 5.0. Garbage collection pauses are shorter because of better use of parallelism during collection periods during which application threads are stopped.
[/quote]
This paper :http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html#0.0.0.%20The%20Concurrent%20Low%20Pause%20Collector|outline
was also quite helpful, but still leaves some unanswered questions. Such as when do the finalizers run? Is there any way to get profiling info about how much time is taken by the finalizer thread?

Here is an example of a collection that starves a priority 31 thread in the System process for 100ms or more:

1526.756: [GC 1526.756: [ParNew: 3968K->0K(4032K), 0.0159318 secs] 16703K->12797K(21260K), 0.0160274 secs]
1538.213: [GC 1538.214: [ParNew: 3968K->0K(4032K), 0.0105305 secs] 16765K->12860K(21260K), 0.0106204 secs]
1549.802: [GC 1549.802: [ParNew: 3968K->0K(4032K), 0.0151089 secs] 16828K->12923K(21260K), 0.0152276 secs]
1561.191: [GC 1561.191: [ParNew: 3968K->0K(4032K), 0.0156131 secs] 16891K->12984K(21260K), 0.0157176 secs]
1572.565: [GC 1572.565: [ParNew: 3968K->0K(4032K), 0.0154554 secs] 16952K->13046K(21260K), 0.0155757 secs]
1583.973: [GC 1583.973: [ParNew: 3968K->0K(4032K), 0.0122170 secs] 17014K->13107K(21260K), 0.0123115 secs]
1595.531: [GC 1595.531: [ParNew: 3968K->0K(4032K), 0.0162585 secs] 17075K->13168K(21260K), 0.0163687 secs]
1606.968: [GC 1606.968: [ParNew: 3968K->0K(4032K), 0.0158047 secs] 17136K->13229K(21260K), 0.0159176 secs]
1606.984: [GC [1 CMS-initial-mark: 13229K(17228K)] 13229K(21260K), 0.0008632 secs]
1606.985: [CMS-concurrent-mark-start]
1607.077: [CMS-concurrent-mark: 0.092/0.092 secs]
1607.077: [CMS-concurrent-preclean-start]
1607.077: [CMS-concurrent-preclean: 0.000/0.000 secs]
1607.092: [GC[YG occupancy: 86 K (4032 K)]1607.092: [Rescan (parallel) , 0.0009427 secs]1607.093: [weak refs processing, 0.0109421 secs] [1 CMS-remark: 13229K(17228K)] 13315K(21260K), 0.0161694 secs]
1607.123: [CMS-concurrent-sweep-start]
1607.308: [CMS-concurrent-sweep: 0.184/0.184 secs]
1607.308: [CMS-concurrent-reset-start]
1607.321: [CMS-concurrent-reset: 0.013/0.013 secs]
Encoding stopped on instance 0
1862.099: [GC 1862.099: [ParNew: 3968K->0K(4032K), 0.0171288 secs] 14714K->10755K(21944K), 0.0172192 secs]
2160.773: [GC 2160.773: [ParNew: 3968K->0K(4032K), 0.0169685 secs] 14723K->10756K(21944K), 0.0170589 secs]

The “encoding stopped” message if from my app realizing that we aborted a video capture because the capture hardware was forced to drop frames.
I’m not sure what the typical timeslice is on WinXP, but I would expect that in 184ms that a priority 31 thread in the system process would get a couple shots at the CPU. It doesn’t need to do much to setup a DMA operation to empty the FIFO of our hardware so it is quite surprising that the hardware FIFO overflowed. The FIFO can buffer 100ms of video frames

princec · July 27, 2005, 1:48pm

I’ve had eclipse juddering so much under 1.5 that the Windows event queue filled up and beeped at me when I’ve moved the mouse - gnarly.

What is at issue here is not thread priorities! The JVM process runs at Normal priority - I can see this in the task manager - and I think, but can’t remember for sure, that none of the threads it can produce can interrupt the “ring 0” stuff like the mouse pointer, which is a realtime priority.

So what is actually happening, I believe, is that either new() or GC may be holding a kernel mode lock somewhere that has everything else waiting for it to finish. The actual throughput of the system is not changed particularly but responsiveness is through the floor as the JVM holds this lock preventing extremely high priority tasks from executing. I suspect it’s some kernel mode call that’s doing it because userland tasks are completely at the mercy of the kernel and don’t normally have the ability to block the kernel unless they perform kernel operations. This is one reason why OpenGL drivers tend to be far stabler than D3D ones, because GL mostly lives in userland whereas DirectX is nastily mixed up into the kernel. Well, that’s how it was last time I read up on it.

The GL pipeline is a total red herring - didn’t have any effect on Eclipse, as I suspected - AWT doesn’t have anything to do with it. However Eclipse generates a hell of a lot of garbage.

There is one other possibility is that JIT compilation is also holding on to this kernel mode lock too. Eclipse does an awful lot of recompiling and the problem is worse when I run Eclipse with the -server switch.

Cas

swpalmer · July 27, 2005, 4:01pm

Well a process priority and the thread priorities within the process are slightly different things. The thread priority of GC could be set high even though the base priority of the process is normal. But I tend to agree with you that it is some interaction with the kernel that is causing this. But what sort of system-wide lock would be held by the JVM?

I went so far as to wonder if it was the interaction with the kernel memory manager. When getting more RAM for the java heap, or releasing RAM when it shrinks the heap there will be page tables and stuff that need to be updated in kernel mode. But I just can’t see there being a significant amount of kernel interaction caused by something like that.

Cas, have you tried -XX:+UseConcMarkSweepGC with -XX:+CMSIncrementalMode to fix the issues you see with Eclipse? (I have never seen similar issues when I run Eclipse on my XP box.)

princec · July 27, 2005, 4:16pm

I too suspect it’s something to do with a kernel mode memory lock of some sort. Those compiler switches just make eclipse bomb out before it finishes loading And the mouse still jerks while it’s loading.

Cas

princec · July 28, 2005, 9:37pm

I’m absolutely certain it’s GC now - I had an OOME today but as the heap gets completely filled and the GC goes into overdrive the system is almost brought to a standstill. I can’t even get at the Task Manager, and the mouse judders like an old 2CV struggling up a hill.

Cas

swpalmer · July 28, 2005, 11:50pm

Moved from the other thread…

The GC activity is significantly different using the same options with Mustang b43.

check this out, I’ve been trying to put together a test case…
Here it is running on 1.5.0_03

C:\DevJava\BugReports>"c:\Program Files\Java\jre1.5.0_03\bin\java" -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps  -cp . JVMCrash
0.000: [GC 0.000: [ParNew: 3967K->0K(4032K), 0.0200582 secs] 3967K->1066K(16320K), 0.0205862 secs]
10.541: [GC 10.541: [ParNew: 3968K->0K(4032K), 0.0104868 secs] 5034K->1395K(16320K), 0.0109198 secs]
32.334: [GC 32.335: [ParNew: 3968K->0K(4032K), 0.0151047 secs] 5363K->1522K(16320K), 0.0155206 secs]
54.148: [GC 54.149: [ParNew: 3968K->0K(4032K), 0.0140430 secs] 5490K->1663K(16320K), 0.0145384 secs]
75.734: [GC 75.735: [ParNew: 3968K->0K(4032K), 0.0071965 secs] 5631K->1823K(16320K), 0.0077193 secs]
97.746: [GC 97.747: [ParNew: 3968K->0K(4032K), 0.0114971 secs] 5791K->1940K(16320K), 0.0119415 secs]
119.838: [GC 119.839: [ParNew: 3968K->0K(4032K), 0.0141531 secs] 5908K->2056K(16320K), 0.0146701 secs]
141.901: [GC 141.901: [ParNew: 3968K->0K(4032K), 0.0155667 secs] 6024K->2171K(16320K), 0.0161009 secs]
163.354: [GC 163.355: [ParNew: 3968K->0K(4032K), 0.0158057 secs] 6139K->2378K(16320K), 0.0163267 secs]

Here is the exact same code running on Mustang b43:

C:\DevJava\BugReports>java  -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps  -cp . JVMCrash
0.000: [GC 0.000: [ParNew: 3968K->0K(4032K), 0.0174371 secs] 3968K->953K(16320K), 0.0181594 secs]
28.058: [GC 28.058: [ParNew: 3968K->0K(4032K), 0.0151061 secs] 4921K->1110K(16320K), 0.0154193 secs]
69.326: [GC 69.327: [ParNew: 3968K->0K(4032K), 0.0135772 secs] 5078K->1113K(16320K), 0.0139347 secs]
110.623: [GC 110.624: [ParNew: 3968K->0K(4032K), 0.0148840 secs] 5081K->1114K(16320K), 0.0152321 secs]
151.814: [GC 151.814: [ParNew: 3968K->0K(4032K), 0.0131889 secs] 5082K->1115K(16320K), 0.0135931 secs]
193.205: [GC 193.206: [ParNew: 3968K->0K(4032K), 0.0138703 secs] 5083K->1117K(16320K), 0.0142879 secs]
234.612: [GC 234.613: [ParNew: 3968K->0K(4032K), 0.0150744 secs] 5085K->1118K(16320K), 0.0155725 secs]
273.818: [GC 273.819: [ParNew: 3968K->0K(4032K), 0.0131767 secs] 5086K->1120K(16320K), 0.0137323 secs]
312.053: [GC 312.054: [ParNew: 3968K->0K(4032K), 0.0141546 secs] 5088K->1122K(16320K), 0.0146340 secs]
350.398: [GC 350.398: [ParNew: 3968K->0K(4032K), 0.0151758 secs] 5090K->1123K(16320K), 0.0156526 secs]

Notice the frequency of the collections, and the rate at which the second set of numbers around the “->” grows.
This test generates no objects that should live longer than about one second. At least not in the test code itself. I think in the JRE code some long-lived objects are accumulatng… maybe some weak map or something?

swpalmer · July 29, 2005, 12:21am

Jeff,

The example above only shows collections of the eden space. They are not a problem in terms ofthe time they take.

The thing to note is the frequency of collections and the amount of garbage promoted to the next generation. I may not be 100% accurate with my guesses as to what the output means but I think I have the general idea right.

The ###->### represents the before and after size of the objects somewhere. For the first set of numbers this seems to correspond to the eden space - once it is consumed the surviving objects are moved out, so there is nothing left and the usage grows fromn 0 again.

The second set of #-># appears to be related to the next generation. I think it might be called the ‘survivor’ space… and here we can see it accumulating objects.

The main thing to point out is that taking the same code and running it on 5.0 and 6.0 you see a huge difference in the amount of objects that survive. It a factor of 100 or more larger for the 1.5.0_03 VM compared to Mustang b43.

Because the longer-lived objects grow over hundred times faster in 1.5 this triggers bigger sweeps of the next generation.

This looks like:

1237.204: [GC 1237.204: [ParNew: 3968K->0K(4032K), 0.0165371 secs] 11910K->8058K(16320K), 0.0168901 secs]
1258.767: [GC 1258.767: [ParNew: 3968K->0K(4032K), 0.0171561 secs] 12026K->8170K(16320K), 0.0177815 secs]
1280.768: [GC 1280.768: [ParNew: 3968K->0K(4032K), 0.0165995 secs] 12138K->8285K(16320K), 0.0170013 secs]
1280.786: [GC [1 CMS-initial-mark: 8285K(12288K)] 8285K(16320K), 0.0004433 secs]

1280.786: [CMS-concurrent-mark-start]
1280.839: [CMS-concurrent-mark: 0.053/0.053 secs]
1280.840: [CMS-concurrent-preclean-start]
1280.840: [CMS-concurrent-preclean: 0.000/0.000 secs]
1280.845: [GC[YG occupancy: 83 K (4032 K)]1280.846: [Rescan (parallel) , 0.00053
82 secs]1280.847: [weak refs processing, 0.0123378 secs] [1 CMS-remark: 8285K(12
288K)] 8368K(16320K), 0.0182686 secs]
1280.878: [CMS-concurrent-sweep-start]
[b]1281.052: [CMS-concurrent-sweep: 0.171/0.171 secs][/b]
1281.052: [CMS-concurrent-reset-start]
1281.056: [CMS-concurrent-reset: 0.004/0.004 secs]
1302.538: [GC 1302.539: [ParNew: 3968K->0K(4032K), 0.0135049 secs] 8193K->4340K(16320K), 0.0139591 secs]
1324.754: [GC 1324.755: [ParNew: 3968K->0K(4032K), 0.0174210 secs] 8308K->4459K(16320K), 0.0183872 secs]

It is the CMS sweep phase that runs solid on all processors for 170-250 ms that causes very bad things to happen.

Guess what! I found what is causing the >100x difference in long lived objects between Java 5 and Java 6. It is an indeterminate JProgressBar, running with the Win XP look and feel. Apparently the Java 5 implementation has some serious performance issues. I’ve taken various GC measurements as well as the output of -Dsun.java2d.trace=count and passed in on to Chris C of Java2D accelerated pipeline fame. He was helping me track down some of the issues that I observed with the garbage and pausing that varied so widely based on the rendering pipeline used.

The big mysteries that remain:
1- Why does painting a silly progress bar generate ANY long-lived objects, since the Progress bar repaints several times a second, what could have references that last more than a fraction of a second? The ParNew collections happen several seconds apart, I suppose that whatever drawing surface happens to be ‘current’ must get promoted to the next stage, even though it becomes garbage an instant later.
2- Why does CSM-sweep lock out ultra-hihg priority threads in other processes, even if it does want all the CPU it can get. It should be running at a “normal” priority.

Jeff · July 29, 2005, 4:02am

I MIGHT have a suggestion for #1.

Without looking at the code, this is just a wild guess but…

There are a few things that can force a short l;ived object to persist past the eden and into the next generation(s).

The chief among these is attaching anything to it that requires special GC handling. This includes finalizers,
and all the Reference types (weak, soft, phantom).

Maybe there is some object being rapidly created and disposed that, in teh 1.5 library, has one of these atached to
it?

JK

princec · July 29, 2005, 10:17am

The actual time spent GC’ing isn’t really the issue though, it’s the fact that the whole system stops - completely - while it’s happening, not just the VM!

Cas

Raghar · July 29, 2005, 6:28pm

Cas could you change Eclipse into metal look and feel? IIRC there is version without SWT.
Could you write an application that would generate the same shudder like Eclipse? How looks JRockit? How works Linux VM on the same computer? Try to set Eclipse to low priority. IIRC only things that are really high priority areA recompiler and GC threads, but it could change its behaviour somehow.

princec · July 30, 2005, 11:32am

Hey, this isn’t just Eclipse, it’s all stuff running under Java 5. Eclipse just generates plenty of garbage. I don’t have much of a problem with my games, for example, because I generate very little garbage and mostly the gameplay hides these machine pauses anyway.

Cas

swpalmer · July 30, 2005, 4:06pm

Well it seems that the weak refs are what do it then, notice this message in the CMS phase " [weak refs processing, 0.0123378 secs]"

I had suspected weak refs might cause objects to linger around a bit, it makes sense really.

swpalmer · July 30, 2005, 4:10pm

Armed with this knowledge of how to generate garbage that gets promoted to the older generation using weak references, perhaps now it won’t be very hard to write a very simple test case to hand over to Sun so they can see the effect of the full-sweep GC on the rest of the system.

This issue is very important to resolve for client-side Java. I imagine Sun does a lot of their GC tests server-side with huge heaps where they just want the GC to go as fast as possible and they aren’t paying any attention to how it affects other processes.