Taming Java GC to prevent stutter, an in-depth post on memory management

Spurred on by this post, here is an in-depth post on memory management in Airships.

Excerpt:

The game is written in Java, where unused data is automatically deleted in a process called garbage collection. This is nice, but comes with a big drawback: you can’t control when garbage collection (GC) happens, and it can take too long.

The game runs at 60 frames per second, giving each frame a bit more than 16 milliseconds to advance the game state and redraw the screen. If GC kicks in during a frame and takes 30 milliseconds, the game stutters noticeably.

So this is something that needs fixing.

Read the rest

From quick skim this all looks questionable. About the only reason I’d object pool would be to simulate structures given that same sized object will be allocated address-first if they happen prior to the first GC.

Is the modern JVM and CPU so efficient that it makes object pooling almost obsolete? and that home brew object pooling is more likely to lead to performance issues rather than resolving them?

Questionable implies the OP and the blog post are a subjective viewpoint, whereas s/he has has cold, hard evidence for a bottle-neck and a means of measuring whether the solution resolves or mitigates it (or even makes it worse). Object-pooling (or any other performance ‘improvement’) has it’s place if there is a real performance bottle-neck due to GC, especially on mobile devices.

What do you mean by simulate structures btw?

On the other hand this is also valid, the more complex the code the more risk you add, and ofc pooling can lead to nasty memory leaks if implemented poorly.

I know the rule is “premature optimization is the root of all evil” but I think this post is actually quite nice and described a well-targeted (if possibly not that neccessary) optimization.

It would be nice, however, to measure improvement in terms of reduction in GC time spent, but that’s pretty hard to do.

The JVM is much better than it used to be. Non-escaping objects and scalar replacement. Object pooling (we’re assuming HotSpot here…and ignoring android) is difficult to get right and most people will slow things down for all their extra effort and in some cases drastically slow things down (in a relative sense).

Questionable here implies: not credible. There’s no support for the cause of problem nor if the solution actually addresses the supposed problem. Great that the apparently problem went away for OP.

Yeah, I’m not using object pooling at all here, I just found three places where I was allocating objects for no good reason. And yeah, I measured before and after my modifications to make sure I was actually catching the right things. And yes, it did definitely remove noticeable spikes in frame times. :slight_smile:

As lots of people say, the GC is very good these days, and you don’t often have to worry about this kind of stuff outside of Android - but in this case, the game was allocating so many objects there were actually noticeable GC pauses messing up the feel of the game.

That should probably have been tuneable out rather than having to make code changes, I think.

Cas :slight_smile:

I’m temporarily using this abomination of a function to do some collision detection, and it generates over a gigabyte of garbage each second. Due to this I get an average of 2-3 garbage collections per second, and most of them are taking more than 7ms. The funny thing is that the fantastically amazing Hotspot should be able to get rid of every single allocation of Vector3s thanks to escape analysis, but it’s not. For VM commands, I only use -server.


	void solveCollision(Triangle triangle){
		
		Vector3 center = new Vector3(position).add(0, height*0.5f, 0);

		Vector3 v0 = triangle.transformV0(center, diameterOverHeight);
		Vector3 v1 = triangle.transformV1(center, diameterOverHeight);
		Vector3 v2 = triangle.transformV2(center, diameterOverHeight);
		
		Vector3 point = closestPointTriangle(new Vector3(), v0, v1, v2);
		
		
		float length = point.len();
		mad(delta, point, -0.1f * Math.max(0, (wallCollisionRadius - length) / length));
		//mad(position, point, -0.1f * Math.max(0, (wallCollisionRadius - length) / length));
		//mad(velocity, point, -1f * Math.max(0, (wallCollisionRadius - length) / length));
	}
	
	public static Vector3 closestPointTriangle(Vector3 p, Vector3 a, Vector3 b, Vector3 c) {
	   // Optimized version as found in book Real time collision detection
	   // b - a
	   Vector3 ab = new Vector3(b).sub(a);
	   
	   // c - a
	   Vector3 ac = new Vector3(c).sub(a);
	   
	   // p - a
	   Vector3 ap = new Vector3(p).sub(a);
	   
	   float d1 = ab.dot(ap);
	   float d2 = ac.dot(ap);
	   
	   if (d1 <= 0.0f && d2 <= 0.0f) {
	      return a;   
	   }
	   
	   // p - b
	   Vector3 bp = new Vector3(p).sub(b);
	   
	   float d3 = ab.dot(bp);
	   float d4 = ac.dot(bp);
	   
	   if (d3 >= 0.0f && d4 <= d3) {
	      return b;   
	   }
	   
	   float vc = d1 * d4 - d3 * d2;
	   
	   if (vc <= 0.0f && d1 >= 0.0f && d3 <= 0.0f) {
	      float v = d1 / (d1 - d3);
	      
	      // a + (ab * v)
	      return mad(new Vector3(a), ab, v);
	   }
	   
	   // p - c
	   Vector3 cp = new Vector3(p).sub(c);
	   
	   float d5 = ab.dot(cp);
	   float d6 = ac.dot(cp);
	   
	   if (d6 >= 0.0f && d5 <= d6) {
	      return c;   
	   }
	   
	   float vb = d5 * d2 - d1 * d6;
	   
	   if (vb <= 0.0f && d2 >= 0.0f && d6 <= 0.0f) {
	      float w = d2 / (d2 - d6);
	      
	      // a + (ac * w)
	      return mad(new Vector3(a), ac, w);
	   }
	   
	   float va = d3 * d6 - d5 * d4;
	   
	   if (va <= 0.0f && (d4 - d3) >= 0.0f && (d5 - d6) >= 0.0f) {
	      float w = (d4 - d3) / ((d4 - d3) + (d5 - d6));
	      
	      // b + (c - b) * w 
	      return mad(new Vector3(b), new Vector3(c).sub(b), w);
	   }
	   
	   float denom = 1.0f / (va + vb + vc);
	   float vn = vb * denom;
	   float wn = vc * denom;
	   
	   // a + ab * vn + ac * wn
	   // this can be done with one line
	   
	   Vector3 abvn = new Vector3(ab).scl(vn);
	   Vector3 acwn = new Vector3(ac).scl(wn);
	   
	   // return result
	   return new Vector3(a).add(abvn).add(acwn);
	}

gigabytes per seconds, the fuck.

you should be chained into a room just with an atari 2600 or c64 for a month >_>

The funny thing is that I’m not even joking.

Two collections of around 950MBs per second. However, other than those 2 functions I do not generate any garbage whatsoever in my entire engine, excluding strings for printing out debugging information.

I wonder how useful a Javaagent or something that transforms method parameters and local variables that are of a marked type containing only primitive fields (e.g. Vectors, Colors) into those primitive fields wherever the object is referenced. Basically inlines all instance fields and methods of those types.

You can keep the OOP niceties in the source, but the bytecode is raw primitives everywhere.
You guys think that would be useful? I’ll have to test whether or not it even helps small-object-heavy-code performance much first though.

Obviously it could be done manually but why when you can have an automatic tool :point:

I too have had multiple Gb/s going on before, crazy.

Riven is working on just that; support for structs that can be on the stack.

The thing about the blog post is that it’s not providing enough information to be useful to anyone beyond the author. Here’s why.

The JRE version isn’t specified, we don’t know if it’s 32-bit or 64-bit or if any parameters have been set. We don’t know if the game is multithread or not. All of this basic information is important.

My experience is that the majority of the time people blaming some unexpected pauses on the GC is simply because they haven’t really examined the problem and in fact is its not the GC at all. The target audience of the post will have no clue about how to identify if the GC is performing a stop-the-world event causing the observed behavior. So the post should specify the method used (or a link to) to identify the problem. Anyone that doesn’t perform this step is simply randomly modifying code is an unreasonable attempt to correct problem which may in fact not exist.

Now the post claims to see stop-the-world events taking up to 30 ms. That’s a really long pause for a game runtime and well beyond my expectations. Modern GCs are quite good at not requiring long stop-the-world events. We’ve come a long way since semi-space collectors. See here for overview of G1 Collector for instance. There’s tons of resources on the various GCs behaviors. To see such a long pause I’d guess that the app might be spamming short life time small objects (which are not deduced to be non-escaping) which lead to heap fragmentation and along comes a larger allocation which can’t be serviced without compacting. If the game was otherwise performing fine then the first step should have been to tweak GC parameters rather than modifying the code. Low risk and small time requirement even if it doesn’t pay off.

Your methodology is potentially flawed. It ignores escape analysis AND you’re running in the IDE. You need a nod to escape analysis. You have to insure that allocating methods have run at least CompileThreshold or be dumping compilations to know that you’re see allocations that will actually happen after warm up. The issue with the IDE is that it can prevent a method from being compiled (details vary) and therefore explode time requirements and allocations.

For the correction steps taken. Nuking the “Direction” class makes sense. It was more work to do that way in the first place and it doesn’t have any upsides. Beyond the burden it causes the GC, it introduces unnecessary indirection and significantly increases random memory reading/write…all pretty undesirable and big time sinks. Likewise for the Color class…I silly code design choice. Now for for-each statements. It seems very unlikely to have a meaningful impact after warm up.

Hm… I agree. theagentd had 7ms pauses with a GB of garbage… so 30 ms would mean even more garbage. I honestly don’t believe the author anymore … :persecutioncomplex:

Don’t misunderstand me. The author could be correct about the STW event and the “fix”, even though the method as stated was flawed. And the STW pauses occurring possibly were reduced. What I’m saying is I don’t have the information to take it as a given and the post needs to be beefed up to be useful to its target audience. Small memory allocations in java are undesirable and it’s a reasonable idea to avoid them if the amount of work is roughly the same.

Do both of them have the same system specs ?

On the subject of escape analysis… I am becoming sceptical that it is working at all. theagentd’s quite right in that all the allocations should be stack-allocated as they’re going nowhere. Why are the heap allocs not being replaced with stack allocs? What’s the magic to print out the appropriate debug information?

Cas :slight_smile:

TheAgentD has at least one awesome system fwiw… his 7ms would definitely take over 20ms on my system:

http://www.headline-benchmark.com/results/0f004cfc-1942-4647-97cd-1a3970ade933

… although maybe he’s traded it in for a netbook :slight_smile:

It’s virtually impossible to keep up with changes. You’d pretty much have to keep up-to-date reading the dev mailing list. Looking through the code is a complete nightmare. It was working reasonably the last time I explicitly looked.

Ideally use a debug build (any takers to build and maintain)? Then you could: -XX:+UnlockDiagnosticVMOptions -XX:+PrintEscapeAnalysis -XX:+PrintEliminateAllocations (well assuming they are still there…not a given)

Production options that come to mind:
+PrintAssembly to examine specific methods
+BCEATraceLevel: set to 3 (it looks like) for max info spewing
+MaxBCEAEstimateLevel: number of nested calls to inspect.
+MaxBCEAEstimateSize: max bytecode size to be considered.

The source is: /hotspot/src/share/vm/ci/bcEscapeAnalyzer.cpp