Taming Java GC to prevent stutter, an in-depth post on memory management

theagentd · April 24, 2014, 2:04am

Riven is working on just that; support for structs that can be on the stack.

Roquen · April 24, 2014, 8:42am

The thing about the blog post is that it’s not providing enough information to be useful to anyone beyond the author. Here’s why.

The JRE version isn’t specified, we don’t know if it’s 32-bit or 64-bit or if any parameters have been set. We don’t know if the game is multithread or not. All of this basic information is important.

My experience is that the majority of the time people blaming some unexpected pauses on the GC is simply because they haven’t really examined the problem and in fact is its not the GC at all. The target audience of the post will have no clue about how to identify if the GC is performing a stop-the-world event causing the observed behavior. So the post should specify the method used (or a link to) to identify the problem. Anyone that doesn’t perform this step is simply randomly modifying code is an unreasonable attempt to correct problem which may in fact not exist.

Now the post claims to see stop-the-world events taking up to 30 ms. That’s a really long pause for a game runtime and well beyond my expectations. Modern GCs are quite good at not requiring long stop-the-world events. We’ve come a long way since semi-space collectors. See here for overview of G1 Collector for instance. There’s tons of resources on the various GCs behaviors. To see such a long pause I’d guess that the app might be spamming short life time small objects (which are not deduced to be non-escaping) which lead to heap fragmentation and along comes a larger allocation which can’t be serviced without compacting. If the game was otherwise performing fine then the first step should have been to tweak GC parameters rather than modifying the code. Low risk and small time requirement even if it doesn’t pay off.

Your methodology is potentially flawed. It ignores escape analysis AND you’re running in the IDE. You need a nod to escape analysis. You have to insure that allocating methods have run at least CompileThreshold or be dumping compilations to know that you’re see allocations that will actually happen after warm up. The issue with the IDE is that it can prevent a method from being compiled (details vary) and therefore explode time requirements and allocations.

For the correction steps taken. Nuking the “Direction” class makes sense. It was more work to do that way in the first place and it doesn’t have any upsides. Beyond the burden it causes the GC, it introduces unnecessary indirection and significantly increases random memory reading/write…all pretty undesirable and big time sinks. Likewise for the Color class…I silly code design choice. Now for for-each statements. It seems very unlikely to have a meaningful impact after warm up.

matheus23 · April 24, 2014, 9:18am

Hm… I agree. theagentd had 7ms pauses with a GB of garbage… so 30 ms would mean even more garbage. I honestly don’t believe the author anymore … :persecutioncomplex:

Roquen · April 24, 2014, 9:41am

Don’t misunderstand me. The author could be correct about the STW event and the “fix”, even though the method as stated was flawed. And the STW pauses occurring possibly were reduced. What I’m saying is I don’t have the information to take it as a given and the post needs to be beefed up to be useful to its target audience. Small memory allocations in java are undesirable and it’s a reasonable idea to avoid them if the amount of work is roughly the same.

65K · April 24, 2014, 9:43am

Do both of them have the same system specs ?

princec · April 24, 2014, 9:47am

On the subject of escape analysis… I am becoming sceptical that it is working at all. theagentd’s quite right in that all the allocations should be stack-allocated as they’re going nowhere. Why are the heap allocs not being replaced with stack allocs? What’s the magic to print out the appropriate debug information?

Cas

ags1 · April 24, 2014, 10:10am

TheAgentD has at least one awesome system fwiw… his 7ms would definitely take over 20ms on my system:

http://www.headline-benchmark.com/results/0f004cfc-1942-4647-97cd-1a3970ade933

… although maybe he’s traded it in for a netbook

Roquen · April 24, 2014, 10:16am

It’s virtually impossible to keep up with changes. You’d pretty much have to keep up-to-date reading the dev mailing list. Looking through the code is a complete nightmare. It was working reasonably the last time I explicitly looked.

Ideally use a debug build (any takers to build and maintain)? Then you could: -XX:+UnlockDiagnosticVMOptions -XX:+PrintEscapeAnalysis -XX:+PrintEliminateAllocations (well assuming they are still there…not a given)

Production options that come to mind:
+PrintAssembly to examine specific methods
+BCEATraceLevel: set to 3 (it looks like) for max info spewing
+MaxBCEAEstimateLevel: number of nested calls to inspect.
+MaxBCEAEstimateSize: max bytecode size to be considered.

The source is: /hotspot/src/share/vm/ci/bcEscapeAnalyzer.cpp

DrHalfway · April 24, 2014, 10:28am

Some of those complex collision detection functions can be a nightmare on the GC to the point where I modified some of them to use local floating point x,y,z values and static reusable Vector3 objects and manual math where applicable. A complete nightmare to write and even a bigger nightmare to maintain, it makes for some of the ugliest code, however, for sake of performance… I’m even looking for openCL alternatives for some of those functions.

TeamworkGuy2 · April 24, 2014, 12:50pm

From my limited experience with the sun/oracle vm, I believe you are correct. There’s plenty of talk on the internet about escape analysis and the JVM, but I don’t think there’s much of anything actually implemented in the JVM. (i.e. it’s mostly speculative hype :cranky:).
I’d been annoyed lately that I had previously wrote a good chunk of my collision detection code to take floats, but this thread is a good reminder that it was probably a good way to design it.

I’ve experimented with OpenCL and on my computer (3GHz Athlon II, GTS 450) it’s often faster to send large chunks of calculates to the GPU then wait around for the CPU to run them (of course there’s the extra work of having to write an OpenCL program to model your calculations, but it’s often worth it for performance critical sections of code).
Especially calculations involving trig and vector or matrix operations where Java’s strict IEEE 754 compliance tends to hurt performance without benefiting game oriented data that can accept non-IEEE exact floating-point results.

There’s a basic tutorial on LWJGL.org’s wiki about OpenCL: http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL

Riven · April 24, 2014, 4:09pm

[quote=“TeamworkGuy2,post:22,topic:48253”]
Here is some code that is actually triggering the escape analysis to enable scalar replacement:

(after warming the VM… waiting for the performance to peak, averaging the results of a few runs…)

-XX:-DoEscapeAnalysis (disabled)


testVec3: 3676ms
testLocals3: 652ms

-XX:+DoEscapeAnalysis (enabled)


testVec3: 641ms
testLocals3: 644ms


	private static void testVec3(float start) {
		Vec3 vec3 = new Vec3(start, start, start);
		for(int i = 0; i < iterations; i++) {
			vec3.add(new Vec3(1, 2, 3)).mul(new Vec3(0.75f, 0.75f, 0.75f));
			vec3.add(new Vec3(1, 2, 3)).mul(new Vec3(0.75f, 0.75f, 0.75f));
			vec3.add(new Vec3(1, 2, 3)).mul(new Vec3(0.75f, 0.75f, 0.75f));
		}
	}

	private static void testLocals3(float start) {
		float vx = start;
		float vy = start;
		float vz = start;

		for(int i = 0; i < iterations; i++) {
			{
				vx += 1;
				vy += 2;
				vz += 3;

				vx *= 0.75f;
				vy *= 0.75f;
				vz *= 0.75f;
			}
			{
				vx += 1;
				vy += 2;
				vz += 3;

				vx *= 0.75f;
				vy *= 0.75f;
				vz *= 0.75f;
			}
			{
				vx += 1;
				vy += 2;
				vz += 3;

				vx *= 0.75f;
				vy *= 0.75f;
				vz *= 0.75f;
			}
		}
	}

So, yeah, it works… but it’s rare to the point of being damn lucky to see it in a real-world case.

Riven · April 24, 2014, 4:47pm

[quote=“theagentd,post:9,topic:48253”]

Not quite 7ms (your frame time was 7ms), but maybe it was a poor sample set. It’s actually not so surprising that the GC can clean this up within one millisecond, because it really doesn’t have to do much. As you know, the GC doesn’t collect any garbage, it merely traces ‘live’ object references and moves groups of reachable objects to a new region in the heap, and flags everything else ‘free’ (a heap at a time, since the GarbageFirst collector). Your use case seems to be the ideal case for the latest collector. Having said that, escape analysis would give you your 1ms back

TeamworkGuy2 · April 24, 2014, 8:15pm

Well good to know, I tend to make a fool of myself when I mention my experiences without researching first

RobinB · April 24, 2014, 8:33pm

I notice (huge) performance improvements when pooling big stuff like bytebuffers.
So in this situation it could be handy, or am i wrong?

theagentd · April 24, 2014, 9:21pm

Right, bad example output. >_> The real problem isn’t the stop-the-world pauses, but the interference the garbage collection has on my multithreaded engine. Although the stop-the-world time is pretty low, the garbage collectors take up quite a bit of time on multiple CPU cores. Here’s some more interesting output:


Frame time: 7.473 ms
Render time: 0.91 ms
[GC 100099K->13611K(259072K), 0.0005243 secs]
-Spike: 22.331825
[GC 98091K->13667K(257024K), 0.0005799 secs]
-[GC 96099K->13611K(273920K), 0.0074180 secs]
-Spike: 44.528187
[GC 112939K->13747K(293888K), 0.0004801 secs]
[GC 133043K->13795K(317952K), 0.0005644 secs]
[GC 157155K->13795K(346624K), 0.0004909 secs]
[GC 185827K->13843K(381440K), 0.0005881 secs]
[GC 220691K->13843K(422912K), 0.0005573 secs]
[GC 262163K->13899K(472576K), 0.0005559 secs]

Frame time screenshot:

A normal frame takes 7.473 ms to render (CPU time). The first spike is the 22.331825 ms spike, the second one is the 44.528187 ms spike as seen in the graph and the output. Also note the 7.4ms GC spike. The frame time is also suffering from much more micro-stuttering than usual. The graph should be pretty much flat (except for the GC spike), not this jumpy.

Roquen · April 25, 2014, 5:46am

Special case reuse of objects will be manageable for most people so yeah this is a more than reasonable thing to do.

Ages ago in one of our “what features does java need” threads I talked about contracts and one of the examples I gave was one for the compiler: @NoReference. Which would allow a programmer to explicitly state where reference cannot escape. This suggestion requires the verifier to make some trivial checks and then EA doesn’t have to deduce anything about the marked reference. There are non-escaping objects that the deduction will never catch (because it would take too much time and/or memory) but the programmer (can) know cannot escape. win.

princec · April 25, 2014, 8:09am

AIUI from the Excelsior guys escape analysis is linear-time to calculate so I don’t think there’s a case where it’ll take “too long” to detect escapes, which means I think that directive might be redundant and the real problem is just that the Hotspot implementation either doesn’t work right or maybe it just needs some more aggressive tuning parameters eg. bytecode depth increasing - perhaps the default is unrealistically shallow (I’m thinking here that they’ve optimised the server VM for its most usual use case which is EJB deployments where they probably don’t care quite so much about small 20ms spikes in GC as we game devs do).

BTW anything with a finalize() method on it or that is referenced by any form of *Reference causes GC on such objects to “slow down” by like a factor of “100” (random high figure selected from thin air) as such objects fall outside the usual usage pattern of Java objects. This means DirectByteBuffers in particular, but if you’re constructing and forgetting DirectByteBuffers every frame you’re not using them as intended anyway.

Cas

princec · April 25, 2014, 8:12am

[quote=“princec,post:29,topic:48253”]
Further musing on this… they have probably run some profiling on some “test cases” and determined that the extra compilation time spent doing escape analysis eventually overtakes the total time spent in GC beyond a certain point, and beyond that it’s simply more sensible to let the latest GC do its thing. Just speculation, mind.

FWIW I’ve had recent experience of the G1GC in Battledroid, and it cut my framerates in half. So there we go.

Cas

Roquen · April 25, 2014, 8:23am

The problem is with runtime compiling. Consider a reference passed to an instance method. If it’s polymorphic you have to check that none of the various methods cause the reference to escape and you have to track it the decomplication framework and reverify each time a new type is loaded which might override the given method (and might be called at the callsite in question). All of these things are possible, but they don’t happen ATM. Each extra case you add increase the burden on the runtime compiler and tracked by the decompliation framework. Life’s a lot easier if you’re AOT.

Tossed this together since at least a couple of people seem interested: http://www.java-gaming.org/topics/escape-analysis/32920/view.html

Spasi · April 25, 2014, 5:08pm

[quote=“princec,post:29,topic:48253”]
The problem with direct ByteBuffers has been resolved, by making the allocating thread help with deallocations. Hopefully it’ll be backported to JDK 7/8 soon.

This will have an effect on allocation performance, so princec’s advice remains important.