I was running some micro-benchmarks comparing 3x3 matrix and 4x4 matrix operations with different ways of storing and fetching the data. I had three variations:
- Arrays of Matrix3 or Matrix4 objects (depending on the type being tested)
- Matrices packed into double[] data, and read into local Matrix3 or Matrix4 to compute everything
- Matrices packed into float[] data
The Matrix3 and Matrix4 all store their components in double fields, e.g. m00 for the 1st row and 1st column.
Obviously the Matrix3 multiplications were faster than Matrix4, but I noticed that as I continued to run test iterations, the Matrix3 class would drastically improve performance after the first iteration, just as if it had been JIT compiled. The Matrix4 tests did not exhibit this behavior. I then enabled the flag -XX:+PrintCompilation to see when things were compiled and a couple of things stood out:
- The tests claimed that both Matrix3 and Matrix4 multiplication methods were compiled
- The compilation happened at the very start of the test cases, before the drastic improvements with Matrix3 and it was never recompiled later on.
Here is a pastebin of the log output from Matrix3: http://pastebin.java-gaming.org/099d10734
This prints out 3 test iterations. Each iteration tests the above 3 variations by performing multiplications over 10,000 matrices, 10,000 times. So each test iteration does 300,000,000 actual invokations. The timings are the total time it took to do the 100,000,000 operations for each variation.
As you can see, the first iteration has everything running pretty slowly but at the same performance level. The start of iteration 2 is slightly faster, but you can see something changes part way through and it speeds up by iteration 3. By the end performance has improved 10-fold. Unfortunately, the lines claiming to have compiled Matrix3’s methods happen way earlier.
I won’t show a paste bin for Matrix4’s run, but its output looks very similar except that there are no performance changes for each iteration. Whatever caused the performance improvements for Matrix3 never kicked in, even though Matrix4’s methods were apparently JIT compiled.
So for those of you who thought tl;dr, I’m wondering what is actually causing the JVM to optimize the Matrix3 code faster than the Matrix4 code. Is it possible that method size is causing a problem. The multiplication methods are unrolled so Matrix4’s version is much larger. Are there any restrictions to how many arguments a method can take that might affect inlining or optimization. Theoretically, neither Matrix3 nor Matrix4’s multiply methods should be inlined since they’re both over the 35 byte inline limit.
Could the delay in performance just be a delay before the JVM started using the compiled method, even though it was compiled right away? Or perhaps System.out and whatever prints the compilations aren’t synchronized?
Thanks