GC Bomb - CMS GC starves high priority system threads

swpalmer · July 27, 2005, 1:15pm

Ah… but I think the OpenGL pipeline thing is a red herring. I emailed Chris Campbell about that and he explained that the openGL pipeline simply generated less garbage due to the way it handed off more operations to OpenGL. My problem occurs with even with -Dsun.java2d.noddraw=true which should mean that only GDI calls are used.

Chris wrote: [quote]I think the reason you see more frequent/smaller GCs with the OGL pipeline is that it is able to perform many operations directly via OpenGL with few intermediate operations. For example, if your application does a lot of bilinear image scaling, the OGL pipeline can handle this directly, whereas the DDraw or GDI pipelines can only handle this using our software loops. Those software loops might require conversion from one image format to another, and therefore temporary intermediate images may be created frequently. More intermediate images means increased burden on GC.

Note that this is just one example of a situation where your app might create more garbage with one pipeline versus another. Have you used the -Dsun.java2d.trace flag?
http://java.sun.com/j2se/1.5.0/docs/guide/2d/flags.html#trace

I would suggest that you try running your app with this flag, first with OGL, then with the other pipelines, and then compare the logs. If you see lots of primitives with “General” in their name, then that’s a good indication that an operation is using intermediate surfaces, and possibly creating more garbage.
[/quote]
I think the sweep phase of the collection might still be blocking high-priority threads in an odd way, but simply for shorter times due to the characteristics of the garbage generated. And of course even GDI calls go through the graphics driver in some form eventually - but I’m afraid I don’t know enough about things at that level.

I also recently found this: http://java.sun.com/performance/reference/whitepapers/5.0_performance.html#2.10
which mentions:

[quote]The Concurrent Low Pause Garbage Collector, also known as CMS, has had several improvements for J2SE 5.0. Garbage collection pauses are shorter because of better use of parallelism during collection periods during which application threads are stopped.
[/quote]
This paper :http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html#0.0.0.%20The%20Concurrent%20Low%20Pause%20Collector|outline
was also quite helpful, but still leaves some unanswered questions. Such as when do the finalizers run? Is there any way to get profiling info about how much time is taken by the finalizer thread?

Here is an example of a collection that starves a priority 31 thread in the System process for 100ms or more:

1526.756: [GC 1526.756: [ParNew: 3968K->0K(4032K), 0.0159318 secs] 16703K->12797K(21260K), 0.0160274 secs]
1538.213: [GC 1538.214: [ParNew: 3968K->0K(4032K), 0.0105305 secs] 16765K->12860K(21260K), 0.0106204 secs]
1549.802: [GC 1549.802: [ParNew: 3968K->0K(4032K), 0.0151089 secs] 16828K->12923K(21260K), 0.0152276 secs]
1561.191: [GC 1561.191: [ParNew: 3968K->0K(4032K), 0.0156131 secs] 16891K->12984K(21260K), 0.0157176 secs]
1572.565: [GC 1572.565: [ParNew: 3968K->0K(4032K), 0.0154554 secs] 16952K->13046K(21260K), 0.0155757 secs]
1583.973: [GC 1583.973: [ParNew: 3968K->0K(4032K), 0.0122170 secs] 17014K->13107K(21260K), 0.0123115 secs]
1595.531: [GC 1595.531: [ParNew: 3968K->0K(4032K), 0.0162585 secs] 17075K->13168K(21260K), 0.0163687 secs]
1606.968: [GC 1606.968: [ParNew: 3968K->0K(4032K), 0.0158047 secs] 17136K->13229K(21260K), 0.0159176 secs]
1606.984: [GC [1 CMS-initial-mark: 13229K(17228K)] 13229K(21260K), 0.0008632 secs]
1606.985: [CMS-concurrent-mark-start]
1607.077: [CMS-concurrent-mark: 0.092/0.092 secs]
1607.077: [CMS-concurrent-preclean-start]
1607.077: [CMS-concurrent-preclean: 0.000/0.000 secs]
1607.092: [GC[YG occupancy: 86 K (4032 K)]1607.092: [Rescan (parallel) , 0.0009427 secs]1607.093: [weak refs processing, 0.0109421 secs] [1 CMS-remark: 13229K(17228K)] 13315K(21260K), 0.0161694 secs]
1607.123: [CMS-concurrent-sweep-start]
1607.308: [CMS-concurrent-sweep: 0.184/0.184 secs]
1607.308: [CMS-concurrent-reset-start]
1607.321: [CMS-concurrent-reset: 0.013/0.013 secs]
Encoding stopped on instance 0
1862.099: [GC 1862.099: [ParNew: 3968K->0K(4032K), 0.0171288 secs] 14714K->10755K(21944K), 0.0172192 secs]
2160.773: [GC 2160.773: [ParNew: 3968K->0K(4032K), 0.0169685 secs] 14723K->10756K(21944K), 0.0170589 secs]

The “encoding stopped” message if from my app realizing that we aborted a video capture because the capture hardware was forced to drop frames.
I’m not sure what the typical timeslice is on WinXP, but I would expect that in 184ms that a priority 31 thread in the system process would get a couple shots at the CPU. It doesn’t need to do much to setup a DMA operation to empty the FIFO of our hardware so it is quite surprising that the hardware FIFO overflowed. The FIFO can buffer 100ms of video frames

princec · July 27, 2005, 1:48pm

I’ve had eclipse juddering so much under 1.5 that the Windows event queue filled up and beeped at me when I’ve moved the mouse - gnarly.

What is at issue here is not thread priorities! The JVM process runs at Normal priority - I can see this in the task manager - and I think, but can’t remember for sure, that none of the threads it can produce can interrupt the “ring 0” stuff like the mouse pointer, which is a realtime priority.

So what is actually happening, I believe, is that either new() or GC may be holding a kernel mode lock somewhere that has everything else waiting for it to finish. The actual throughput of the system is not changed particularly but responsiveness is through the floor as the JVM holds this lock preventing extremely high priority tasks from executing. I suspect it’s some kernel mode call that’s doing it because userland tasks are completely at the mercy of the kernel and don’t normally have the ability to block the kernel unless they perform kernel operations. This is one reason why OpenGL drivers tend to be far stabler than D3D ones, because GL mostly lives in userland whereas DirectX is nastily mixed up into the kernel. Well, that’s how it was last time I read up on it.

The GL pipeline is a total red herring - didn’t have any effect on Eclipse, as I suspected - AWT doesn’t have anything to do with it. However Eclipse generates a hell of a lot of garbage.

There is one other possibility is that JIT compilation is also holding on to this kernel mode lock too. Eclipse does an awful lot of recompiling and the problem is worse when I run Eclipse with the -server switch.

Cas

swpalmer · July 27, 2005, 4:01pm

Well a process priority and the thread priorities within the process are slightly different things. The thread priority of GC could be set high even though the base priority of the process is normal. But I tend to agree with you that it is some interaction with the kernel that is causing this. But what sort of system-wide lock would be held by the JVM?

I went so far as to wonder if it was the interaction with the kernel memory manager. When getting more RAM for the java heap, or releasing RAM when it shrinks the heap there will be page tables and stuff that need to be updated in kernel mode. But I just can’t see there being a significant amount of kernel interaction caused by something like that.

Cas, have you tried -XX:+UseConcMarkSweepGC with -XX:+CMSIncrementalMode to fix the issues you see with Eclipse? (I have never seen similar issues when I run Eclipse on my XP box.)

princec · July 27, 2005, 4:16pm

I too suspect it’s something to do with a kernel mode memory lock of some sort. Those compiler switches just make eclipse bomb out before it finishes loading And the mouse still jerks while it’s loading.

Cas

princec · July 28, 2005, 9:37pm

I’m absolutely certain it’s GC now - I had an OOME today but as the heap gets completely filled and the GC goes into overdrive the system is almost brought to a standstill. I can’t even get at the Task Manager, and the mouse judders like an old 2CV struggling up a hill.

Cas

swpalmer · July 28, 2005, 11:50pm

Moved from the other thread…

The GC activity is significantly different using the same options with Mustang b43.

check this out, I’ve been trying to put together a test case…
Here it is running on 1.5.0_03

C:\DevJava\BugReports>"c:\Program Files\Java\jre1.5.0_03\bin\java" -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps  -cp . JVMCrash
0.000: [GC 0.000: [ParNew: 3967K->0K(4032K), 0.0200582 secs] 3967K->1066K(16320K), 0.0205862 secs]
10.541: [GC 10.541: [ParNew: 3968K->0K(4032K), 0.0104868 secs] 5034K->1395K(16320K), 0.0109198 secs]
32.334: [GC 32.335: [ParNew: 3968K->0K(4032K), 0.0151047 secs] 5363K->1522K(16320K), 0.0155206 secs]
54.148: [GC 54.149: [ParNew: 3968K->0K(4032K), 0.0140430 secs] 5490K->1663K(16320K), 0.0145384 secs]
75.734: [GC 75.735: [ParNew: 3968K->0K(4032K), 0.0071965 secs] 5631K->1823K(16320K), 0.0077193 secs]
97.746: [GC 97.747: [ParNew: 3968K->0K(4032K), 0.0114971 secs] 5791K->1940K(16320K), 0.0119415 secs]
119.838: [GC 119.839: [ParNew: 3968K->0K(4032K), 0.0141531 secs] 5908K->2056K(16320K), 0.0146701 secs]
141.901: [GC 141.901: [ParNew: 3968K->0K(4032K), 0.0155667 secs] 6024K->2171K(16320K), 0.0161009 secs]
163.354: [GC 163.355: [ParNew: 3968K->0K(4032K), 0.0158057 secs] 6139K->2378K(16320K), 0.0163267 secs]

Here is the exact same code running on Mustang b43:

C:\DevJava\BugReports>java  -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps  -cp . JVMCrash
0.000: [GC 0.000: [ParNew: 3968K->0K(4032K), 0.0174371 secs] 3968K->953K(16320K), 0.0181594 secs]
28.058: [GC 28.058: [ParNew: 3968K->0K(4032K), 0.0151061 secs] 4921K->1110K(16320K), 0.0154193 secs]
69.326: [GC 69.327: [ParNew: 3968K->0K(4032K), 0.0135772 secs] 5078K->1113K(16320K), 0.0139347 secs]
110.623: [GC 110.624: [ParNew: 3968K->0K(4032K), 0.0148840 secs] 5081K->1114K(16320K), 0.0152321 secs]
151.814: [GC 151.814: [ParNew: 3968K->0K(4032K), 0.0131889 secs] 5082K->1115K(16320K), 0.0135931 secs]
193.205: [GC 193.206: [ParNew: 3968K->0K(4032K), 0.0138703 secs] 5083K->1117K(16320K), 0.0142879 secs]
234.612: [GC 234.613: [ParNew: 3968K->0K(4032K), 0.0150744 secs] 5085K->1118K(16320K), 0.0155725 secs]
273.818: [GC 273.819: [ParNew: 3968K->0K(4032K), 0.0131767 secs] 5086K->1120K(16320K), 0.0137323 secs]
312.053: [GC 312.054: [ParNew: 3968K->0K(4032K), 0.0141546 secs] 5088K->1122K(16320K), 0.0146340 secs]
350.398: [GC 350.398: [ParNew: 3968K->0K(4032K), 0.0151758 secs] 5090K->1123K(16320K), 0.0156526 secs]

Notice the frequency of the collections, and the rate at which the second set of numbers around the “->” grows.
This test generates no objects that should live longer than about one second. At least not in the test code itself. I think in the JRE code some long-lived objects are accumulatng… maybe some weak map or something?

swpalmer · July 29, 2005, 12:21am

Jeff,

The example above only shows collections of the eden space. They are not a problem in terms ofthe time they take.

The thing to note is the frequency of collections and the amount of garbage promoted to the next generation. I may not be 100% accurate with my guesses as to what the output means but I think I have the general idea right.

The ###->### represents the before and after size of the objects somewhere. For the first set of numbers this seems to correspond to the eden space - once it is consumed the surviving objects are moved out, so there is nothing left and the usage grows fromn 0 again.

The second set of #-># appears to be related to the next generation. I think it might be called the ‘survivor’ space… and here we can see it accumulating objects.

The main thing to point out is that taking the same code and running it on 5.0 and 6.0 you see a huge difference in the amount of objects that survive. It a factor of 100 or more larger for the 1.5.0_03 VM compared to Mustang b43.

Because the longer-lived objects grow over hundred times faster in 1.5 this triggers bigger sweeps of the next generation.

This looks like:

1237.204: [GC 1237.204: [ParNew: 3968K->0K(4032K), 0.0165371 secs] 11910K->8058K(16320K), 0.0168901 secs]
1258.767: [GC 1258.767: [ParNew: 3968K->0K(4032K), 0.0171561 secs] 12026K->8170K(16320K), 0.0177815 secs]
1280.768: [GC 1280.768: [ParNew: 3968K->0K(4032K), 0.0165995 secs] 12138K->8285K(16320K), 0.0170013 secs]
1280.786: [GC [1 CMS-initial-mark: 8285K(12288K)] 8285K(16320K), 0.0004433 secs]

1280.786: [CMS-concurrent-mark-start]
1280.839: [CMS-concurrent-mark: 0.053/0.053 secs]
1280.840: [CMS-concurrent-preclean-start]
1280.840: [CMS-concurrent-preclean: 0.000/0.000 secs]
1280.845: [GC[YG occupancy: 83 K (4032 K)]1280.846: [Rescan (parallel) , 0.00053
82 secs]1280.847: [weak refs processing, 0.0123378 secs] [1 CMS-remark: 8285K(12
288K)] 8368K(16320K), 0.0182686 secs]
1280.878: [CMS-concurrent-sweep-start]
[b]1281.052: [CMS-concurrent-sweep: 0.171/0.171 secs][/b]
1281.052: [CMS-concurrent-reset-start]
1281.056: [CMS-concurrent-reset: 0.004/0.004 secs]
1302.538: [GC 1302.539: [ParNew: 3968K->0K(4032K), 0.0135049 secs] 8193K->4340K(16320K), 0.0139591 secs]
1324.754: [GC 1324.755: [ParNew: 3968K->0K(4032K), 0.0174210 secs] 8308K->4459K(16320K), 0.0183872 secs]

It is the CMS sweep phase that runs solid on all processors for 170-250 ms that causes very bad things to happen.

Guess what! I found what is causing the >100x difference in long lived objects between Java 5 and Java 6. It is an indeterminate JProgressBar, running with the Win XP look and feel. Apparently the Java 5 implementation has some serious performance issues. I’ve taken various GC measurements as well as the output of -Dsun.java2d.trace=count and passed in on to Chris C of Java2D accelerated pipeline fame. He was helping me track down some of the issues that I observed with the garbage and pausing that varied so widely based on the rendering pipeline used.

The big mysteries that remain:
1- Why does painting a silly progress bar generate ANY long-lived objects, since the Progress bar repaints several times a second, what could have references that last more than a fraction of a second? The ParNew collections happen several seconds apart, I suppose that whatever drawing surface happens to be ‘current’ must get promoted to the next stage, even though it becomes garbage an instant later.
2- Why does CSM-sweep lock out ultra-hihg priority threads in other processes, even if it does want all the CPU it can get. It should be running at a “normal” priority.

Jeff · July 29, 2005, 4:02am

I MIGHT have a suggestion for #1.

Without looking at the code, this is just a wild guess but…

There are a few things that can force a short l;ived object to persist past the eden and into the next generation(s).

The chief among these is attaching anything to it that requires special GC handling. This includes finalizers,
and all the Reference types (weak, soft, phantom).

Maybe there is some object being rapidly created and disposed that, in teh 1.5 library, has one of these atached to
it?

JK

princec · July 29, 2005, 10:17am

The actual time spent GC’ing isn’t really the issue though, it’s the fact that the whole system stops - completely - while it’s happening, not just the VM!

Cas

Raghar · July 29, 2005, 6:28pm

Cas could you change Eclipse into metal look and feel? IIRC there is version without SWT.
Could you write an application that would generate the same shudder like Eclipse? How looks JRockit? How works Linux VM on the same computer? Try to set Eclipse to low priority. IIRC only things that are really high priority areA recompiler and GC threads, but it could change its behaviour somehow.

princec · July 30, 2005, 11:32am

Hey, this isn’t just Eclipse, it’s all stuff running under Java 5. Eclipse just generates plenty of garbage. I don’t have much of a problem with my games, for example, because I generate very little garbage and mostly the gameplay hides these machine pauses anyway.

Cas

swpalmer · July 30, 2005, 4:06pm

Well it seems that the weak refs are what do it then, notice this message in the CMS phase " [weak refs processing, 0.0123378 secs]"

I had suspected weak refs might cause objects to linger around a bit, it makes sense really.

swpalmer · July 30, 2005, 4:10pm

Armed with this knowledge of how to generate garbage that gets promoted to the older generation using weak references, perhaps now it won’t be very hard to write a very simple test case to hand over to Sun so they can see the effect of the full-sweep GC on the rest of the system.

This issue is very important to resolve for client-side Java. I imagine Sun does a lot of their GC tests server-side with huge heaps where they just want the GC to go as fast as possible and they aren’t paying any attention to how it affects other processes.

trembovetski · August 1, 2005, 1:13am

The garbage that’s being generated by the JProgressBar in windows l&f is likely
to be the image cache that components in win l&f use when rendering.

I’m not entirely sure how it works in swing but they try to cache some of the
pieces of the componets depending on its state. Think of it this way: the L&F
code asks windows to render a button to an offscreen image, and then uses that
image to display the button. This image is cached. There are several states of the button
(rolled over, disabled, etc), so there are multiple images per component. But when a
button is resized, some of those images become invalid and has to be thrown away…

Since the progress bar is constantly changing they’re probaby generating and throwing away a
lot of images.

Now those images’ native resources are disposed of using reference queue which
is ran on a special Disposer thread (you can see it in the stack trace as Java2D Disposer).
This is done to avoid the use of finalizers and speed up the disposal.
Note that this disposer thread is a high priority thread, which may caused the starvation of
other threads if plenty of garbage is generated and the thread is busy getting rid of it.

[added] Also, if those images were accelerated (that is, cached in vram which they’re likely
to be), then the disposal may involve DirectDraw/Direct3D, wihch means some global system-wide
locks may be taken by the DirectX, which could worsen the situation…

I’ve pointed swing folks at this thread, may be they can shed more light on why the jprogressbar’s
caching scheme is so suboptimal in this case.

princec · August 1, 2005, 10:08am

OK, I’ve tried several microbenchmarks to hammer the GC, including SoftReferences, WeakReferences, finalizers, and even BufferedImages and calling getGraphics and failing to dispose of it.

In all cases the CPU shoots up to 100% as expected but the jerking does not occur. This means that it’s not a trivial GC issue, it’s the interaction of the GC with some other part of the system which is causing the problem. It might possibly be something to do with the JIT.

All this Swing stuff and AWT etc. are red herrings, as I say, Eclipse makes no use of any of it whatsoever.

Cas

swpalmer · August 1, 2005, 8:56pm

I think the thread priority of the Java2D Disposer needs to be reworked. It appears to be having a catestrophic affect on the system. I’ve selected the CMS collector to reduce pause times, but I’m guessing that during the sweep phase of the CMS collector the Java2D Disposer is signalled and very nasty stuff happens that bring down the system for 100+ms.

Why must he Java2D disposer run at a high priority? Better yet, can I hack things to lower the priority? Hmmm.

Cas, I think your experiment might have failed because it did not create work for this Java2D disposer.

princec · August 2, 2005, 9:26am

I had hoped to - I created tons of RGB BufferedImages and got the graphics out of each one and rendered a black rectangle into it.

But it’s still a red herring. SWT doesn’t use the Java2D disposer, yet Eclipse is one of the worst sufferers of the jerking.

Cas

trembovetski · August 2, 2005, 5:13pm

[quote]Why must he Java2D disposer run at a high priority? Better yet, can I hack things to lower the priority? Hmmm.
[/quote]
Because otherwise a thread which creates tons of images will starve the collector and we’ll get OOE.
This is a bug the introduction of the Disposer had addressed.

If you can build the mustang workspace, you can easily experiment with the disposer.
The file is j2se/src/share/classes/sun/java2d/Disposer.java .
You don’t need to rebulid the whole workspace, though: you can just hack it, compile the file
to some directory and prepend it to the bootclasspath.

Dmitri

princec · August 4, 2005, 1:31pm

PROBLEM SOLVED

(Would appreciate if you could escalate this for me Trembo/Jeff)

I got suspicious about the whole thread priority issue so I looked at javaw.exe through PView on NT, and lo and behold, there were several time critical priority 15 threads running!!! DUH!!! I’m not sure what nincompoop decided that messing around with the thread priority mappings between JDK major releases was in any way wise at all.

Now it occurs to me the very same problem manifested itself to me when my games stopped responding to mouse and keyboard input on some systems. I used to run the main loop at Thread.MAX_PRIORITY but on Java 5.0 the main loop was mapped into time-critical priority - so DirectX never got to poll the bloody keyboard and mouse!

The fix is simple: remap the offending Java priorities to more reasonable NT ones:

-XX:JavaPriority10_To_OSPriority=10 -XX:JavaPriority9_To_OSPriority=9

Please patch this in the next minor update of 1.5.0 as it has such terrible and wide-ranging mysterious consequences.

Eclipse now runs better than it has ever, ever done - it feels utterly smooth and native now. I’m flabberghasted at the transformation from powerful-but-clunky-and-jerky IDE into butter-smooth ultraIDE.

Cas

swpalmer · August 4, 2005, 1:58pm

After a little research what I find interesting is that priority 15 is in a different priority class!

On Win32:

NORMAL is priority 8.

For a Win32 process the thread priorities can only be adjusted +/- 2 within a priority class. So theoretically if java.exe is running as a normal priority application the max thread priority would usually be 10 anyway. That would correspond to setting the thread to a relative priority of HIGHEST within a NORMAL priority class.

It seems though that the JRE is using the TIME_CRITICAL setting to rail the priority up to the highest possible for non-realtime processes (15). (You need to have admin rights to go above 15 and run as a realtime process.) Interestingly enough, java at priority 15 was able to starve our driver thread at priority 31… that either says something about the quality of the windows scheduler, or there are icky things happening with driver calls and/or system-wide locks in the java threads.

The first thing to suspect of course whenever an application causes stuff to starve/freeze is the graphics drivers. They are notorious for being very unfriendly to the system. We are having a hell of a time finding systems that don’t lock up after a few hours. Even with noddraw=true the graphics drivers will get stuck in an infinite loop occasionally. What is quite cool is that I got a real error message from Windows the last time this happened. It actually said in plain english (after a reboot) that the nVidia driver (nv_disp) was stuck in an infinite loop.

See http://emea.windowsitpro.com/Windows/Article/ArticleID/302/302.html