Taming Java GC to prevent stutter, an in-depth post on memory management

Some of those complex collision detection functions can be a nightmare on the GC to the point where I modified some of them to use local floating point x,y,z values and static reusable Vector3 objects and manual math where applicable. A complete nightmare to write and even a bigger nightmare to maintain, it makes for some of the ugliest code, however, for sake of performance… I’m even looking for openCL alternatives for some of those functions.

From my limited experience with the sun/oracle vm, I believe you are correct. There’s plenty of talk on the internet about escape analysis and the JVM, but I don’t think there’s much of anything actually implemented in the JVM. (i.e. it’s mostly speculative hype :cranky:).
I’d been annoyed lately that I had previously wrote a good chunk of my collision detection code to take floats, but this thread is a good reminder that it was probably a good way to design it.

I’ve experimented with OpenCL and on my computer (3GHz Athlon II, GTS 450) it’s often faster to send large chunks of calculates to the GPU then wait around for the CPU to run them (of course there’s the extra work of having to write an OpenCL program to model your calculations, but it’s often worth it for performance critical sections of code).
Especially calculations involving trig and vector or matrix operations where Java’s strict IEEE 754 compliance tends to hurt performance without benefiting game oriented data that can accept non-IEEE exact floating-point results.

There’s a basic tutorial on LWJGL.org’s wiki about OpenCL: http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL

[quote=“TeamworkGuy2,post:22,topic:48253”]
Here is some code that is actually triggering the escape analysis to enable scalar replacement:

(after warming the VM… waiting for the performance to peak, averaging the results of a few runs…)

-XX:-DoEscapeAnalysis (disabled)


testVec3: 3676ms
testLocals3: 652ms

-XX:+DoEscapeAnalysis (enabled)


testVec3: 641ms
testLocals3: 644ms


	private static void testVec3(float start) {
		Vec3 vec3 = new Vec3(start, start, start);
		for(int i = 0; i < iterations; i++) {
			vec3.add(new Vec3(1, 2, 3)).mul(new Vec3(0.75f, 0.75f, 0.75f));
			vec3.add(new Vec3(1, 2, 3)).mul(new Vec3(0.75f, 0.75f, 0.75f));
			vec3.add(new Vec3(1, 2, 3)).mul(new Vec3(0.75f, 0.75f, 0.75f));
		}
	}

	private static void testLocals3(float start) {
		float vx = start;
		float vy = start;
		float vz = start;

		for(int i = 0; i < iterations; i++) {
			{
				vx += 1;
				vy += 2;
				vz += 3;

				vx *= 0.75f;
				vy *= 0.75f;
				vz *= 0.75f;
			}
			{
				vx += 1;
				vy += 2;
				vz += 3;

				vx *= 0.75f;
				vy *= 0.75f;
				vz *= 0.75f;
			}
			{
				vx += 1;
				vy += 2;
				vz += 3;

				vx *= 0.75f;
				vy *= 0.75f;
				vz *= 0.75f;
			}
		}
	}

So, yeah, it works… but it’s rare to the point of being damn lucky to see it in a real-world case.

[quote=“theagentd,post:9,topic:48253”]

Not quite 7ms (your frame time was 7ms), but maybe it was a poor sample set. It’s actually not so surprising that the GC can clean this up within one millisecond, because it really doesn’t have to do much. As you know, the GC doesn’t collect any garbage, it merely traces ‘live’ object references and moves groups of reachable objects to a new region in the heap, and flags everything else ‘free’ (a heap at a time, since the GarbageFirst collector). Your use case seems to be the ideal case for the latest collector. Having said that, escape analysis would give you your 1ms back :slight_smile:

Well good to know, I tend to make a fool of myself when I mention my experiences without researching first :stuck_out_tongue:

I notice (huge) performance improvements when pooling big stuff like bytebuffers.
So in this situation it could be handy, or am i wrong?

Right, bad example output. >_> The real problem isn’t the stop-the-world pauses, but the interference the garbage collection has on my multithreaded engine. Although the stop-the-world time is pretty low, the garbage collectors take up quite a bit of time on multiple CPU cores. Here’s some more interesting output:


Frame time: 7.473 ms
Render time: 0.91 ms
[GC 100099K->13611K(259072K), 0.0005243 secs]
-Spike: 22.331825
[GC 98091K->13667K(257024K), 0.0005799 secs]
-[GC 96099K->13611K(273920K), 0.0074180 secs]
-Spike: 44.528187
[GC 112939K->13747K(293888K), 0.0004801 secs]
[GC 133043K->13795K(317952K), 0.0005644 secs]
[GC 157155K->13795K(346624K), 0.0004909 secs]
[GC 185827K->13843K(381440K), 0.0005881 secs]
[GC 220691K->13843K(422912K), 0.0005573 secs]
[GC 262163K->13899K(472576K), 0.0005559 secs]

Frame time screenshot:

A normal frame takes 7.473 ms to render (CPU time). The first spike is the 22.331825 ms spike, the second one is the 44.528187 ms spike as seen in the graph and the output. Also note the 7.4ms GC spike. The frame time is also suffering from much more micro-stuttering than usual. The graph should be pretty much flat (except for the GC spike), not this jumpy.

Special case reuse of objects will be manageable for most people so yeah this is a more than reasonable thing to do.

Ages ago in one of our “what features does java need” threads I talked about contracts and one of the examples I gave was one for the compiler: @NoReference. Which would allow a programmer to explicitly state where reference cannot escape. This suggestion requires the verifier to make some trivial checks and then EA doesn’t have to deduce anything about the marked reference. There are non-escaping objects that the deduction will never catch (because it would take too much time and/or memory) but the programmer (can) know cannot escape. win.

AIUI from the Excelsior guys escape analysis is linear-time to calculate so I don’t think there’s a case where it’ll take “too long” to detect escapes, which means I think that directive might be redundant and the real problem is just that the Hotspot implementation either doesn’t work right or maybe it just needs some more aggressive tuning parameters eg. bytecode depth increasing - perhaps the default is unrealistically shallow (I’m thinking here that they’ve optimised the server VM for its most usual use case which is EJB deployments where they probably don’t care quite so much about small 20ms spikes in GC as we game devs do).

BTW anything with a finalize() method on it or that is referenced by any form of *Reference causes GC on such objects to “slow down” by like a factor of “100” (random high figure selected from thin air) as such objects fall outside the usual usage pattern of Java objects. This means DirectByteBuffers in particular, but if you’re constructing and forgetting DirectByteBuffers every frame you’re not using them as intended anyway.

Cas :slight_smile:

[quote=“princec,post:29,topic:48253”]
Further musing on this… they have probably run some profiling on some “test cases” and determined that the extra compilation time spent doing escape analysis eventually overtakes the total time spent in GC beyond a certain point, and beyond that it’s simply more sensible to let the latest GC do its thing. Just speculation, mind.

FWIW I’ve had recent experience of the G1GC in Battledroid, and it cut my framerates in half. So there we go.

Cas :slight_smile:

The problem is with runtime compiling. Consider a reference passed to an instance method. If it’s polymorphic you have to check that none of the various methods cause the reference to escape and you have to track it the decomplication framework and reverify each time a new type is loaded which might override the given method (and might be called at the callsite in question). All of these things are possible, but they don’t happen ATM. Each extra case you add increase the burden on the runtime compiler and tracked by the decompliation framework. Life’s a lot easier if you’re AOT.

Tossed this together since at least a couple of people seem interested: http://www.java-gaming.org/topics/escape-analysis/32920/view.html

[quote=“princec,post:29,topic:48253”]
The problem with direct ByteBuffers has been resolved, by making the allocating thread help with deallocations. Hopefully it’ll be backported to JDK 7/8 soon.

This will have an effect on allocation performance, so princec’s advice remains important.