Understanding -XX:+PrintCompilation

I was running some micro-benchmarks comparing 3x3 matrix and 4x4 matrix operations with different ways of storing and fetching the data. I had three variations:

  1. Arrays of Matrix3 or Matrix4 objects (depending on the type being tested)
  2. Matrices packed into double[] data, and read into local Matrix3 or Matrix4 to compute everything
  3. Matrices packed into float[] data

The Matrix3 and Matrix4 all store their components in double fields, e.g. m00 for the 1st row and 1st column.

Obviously the Matrix3 multiplications were faster than Matrix4, but I noticed that as I continued to run test iterations, the Matrix3 class would drastically improve performance after the first iteration, just as if it had been JIT compiled. The Matrix4 tests did not exhibit this behavior. I then enabled the flag -XX:+PrintCompilation to see when things were compiled and a couple of things stood out:

  1. The tests claimed that both Matrix3 and Matrix4 multiplication methods were compiled
  2. The compilation happened at the very start of the test cases, before the drastic improvements with Matrix3 and it was never recompiled later on.

Here is a pastebin of the log output from Matrix3: http://pastebin.java-gaming.org/099d10734

This prints out 3 test iterations. Each iteration tests the above 3 variations by performing multiplications over 10,000 matrices, 10,000 times. So each test iteration does 300,000,000 actual invokations. The timings are the total time it took to do the 100,000,000 operations for each variation.

As you can see, the first iteration has everything running pretty slowly but at the same performance level. The start of iteration 2 is slightly faster, but you can see something changes part way through and it speeds up by iteration 3. By the end performance has improved 10-fold. Unfortunately, the lines claiming to have compiled Matrix3’s methods happen way earlier.

I won’t show a paste bin for Matrix4’s run, but its output looks very similar except that there are no performance changes for each iteration. Whatever caused the performance improvements for Matrix3 never kicked in, even though Matrix4’s methods were apparently JIT compiled.


So for those of you who thought tl;dr, I’m wondering what is actually causing the JVM to optimize the Matrix3 code faster than the Matrix4 code. Is it possible that method size is causing a problem. The multiplication methods are unrolled so Matrix4’s version is much larger. Are there any restrictions to how many arguments a method can take that might affect inlining or optimization. Theoretically, neither Matrix3 nor Matrix4’s multiply methods should be inlined since they’re both over the 35 byte inline limit.

Could the delay in performance just be a delay before the JVM started using the compiled method, even though it was compiled right away? Or perhaps System.out and whatever prints the compilations aren’t synchronized?

Thanks

Try using -XX:+PrintOptoAssembly and see what it generates.

Even if it generates nearly the same assembly, it could be the 3x3 matrix fits better into a cache line than the 4x4 which is nearly twice the size.

I should post a quick summary of the performance results to demonstrate my confusion:

[tr][td]Iteration[/td][td]Storage type[/td][td]Matrix3[/td][td]Matrix4[/td][/tr]
[tr][td]1[/td][td]Matrix[][/td][td]~1100ms[/td][td]~5300ms[/td][/tr]
[tr][td]1[/td][td]double[][/td][td]~800ms[/td][td]~5600ms[/td][/tr]
[tr][td]1[/td][td]float[][/td][td]~1200ms[/td][td]~6000ms[/td][/tr]

[tr][td]2[/td][td]Matrix[][/td][td]~500ms[/td][td]~4800ms[/td][/tr]
[tr][td]2[/td][td]double[][/td][td]~70ms[/td][td]~5200ms[/td][/tr]
[tr][td]2[/td][td]float[][/td][td]~70ms[/td][td]~6000ms[/td][/tr]

[tr][td]3[/td][td]Matrix[][/td][td]~100ms[/td][td]~5000ms[/td][/tr]
[tr][td]3[/td][td]double[][/td][td]~70ms[/td][td]~5300ms[/td][/tr]
[tr][td]3[/td][td]float[][/td][td]~70ms[/td][td]~5700ms[/td][/tr]

As you can see, the Matrix3 operations get optimized quite effectively, but the Matrix4 do not. The compilations supposedly occur for both Matrix3 and Matrix4, at the very start of iteration 1 using the Matrix[] arrays.

That requires a debug JVM as I understand it. ATM I don’t have one of those, but I will try and get one to build for Mac (mayhaps I’ll get OpenJDK 7 installed then, too).

I’ve written up a standalone test that combines both Matrix3 and Matrix4 into the same benchmark: http://www.java-gaming.org/?action=pastebin&id=31

Writing it gave me some the opportunity to improve upon the benchmark and I discovered the problem. The line:


   3851   6%      com.ferox.math.Matrix3Test::main @ 205 (375 bytes)

occurs right when I notice the performance boost in my original run. I think the JVM was able to notice that my inner loops were not doing “anything” and optimized them. In the standalone test, I accumulate the total results, and the JVM was no longer able to make that optimization in the main() method. Performance then remained the same for all iterations for both Matrix3 and Matrix4.

So I’ve solved it, but people can feel free to run the benchmark if they want or not. I’d be curious to see the output if someone could run it with -XX:+PrintOptoAssembly for me.