The normalization optimization is a pretty good idea. It’s safe to assume that multiplies are much cheaper than divides when it comes to floats and doubles, so calculating 1/length and multiplying all 3 components by that value is faster than dividing them by length. LibGDX does this. This applies to all normalization JOML does.
Concerning abc != dbc: This is because Java strictly follows the ordering. abc = (ab)c while dbc = (d*b)c, so in this case the compiler can’t first calculate bc and then reuse the result for both calculations. The lesson to take home is that if you want a value to be guaranteed to only be computed once, you should manually calculate it in a temporary variable.
Concerning the branch to avoid local variables, it sounds like a bad idea. I tried commenting out one or the other in my skeleton animation test and there was no difference in performance. It just makes the code harder to read and maintain IMO.
Also, I believe mul4x3 can be optimized quite a bit.
Red = 0.0, blue = 1.0.
m00 * right.m00 + m10 * right.m01 + m20 * right.m02 + m30 * right.m03,
m01 * right.m00 + m11 * right.m01 + m21 * right.m02 + m31 * right.m03,
m02 * right.m00 + m12 * right.m01 + m22 * right.m02 + m32 * right.m03,
m03 * right.m00 + m13 * right.m01 + m23 * right.m02 + m33 * right.m03,
m00 * right.m10 + m10 * right.m11 + m20 * right.m12 + m30 * right.m13,
m01 * right.m10 + m11 * right.m11 + m21 * right.m12 + m31 * right.m13,
m02 * right.m10 + m12 * right.m11 + m22 * right.m12 + m32 * right.m13,
m03 * right.m10 + m13 * right.m11 + m23 * right.m12 + m33 * right.m13,
m00 * right.m20 + m10 * right.m21 + m20 * right.m22 + m30 * right.m23,
m01 * right.m20 + m11 * right.m21 + m21 * right.m22 + m31 * right.m23,
m02 * right.m20 + m12 * right.m21 + m22 * right.m22 + m32 * right.m23,
m03 * right.m20 + m13 * right.m21 + m23 * right.m22 + m33 * right.m23,
m00 * right.m30 + m10 * right.m31 + m20 * right.m32 + m30 * right.m33,
m01 * right.m30 + m11 * right.m31 + m21 * right.m32 + m31 * right.m33,
m02 * right.m30 + m12 * right.m31 + m22 * right.m32 + m32 * right.m33,
m03 * right.m30 + m13 * right.m31 + m23 * right.m32 + m33 * right.m33
Optimized:
m00 * right.m00 + m10 * right.m01 + m20 * right.m02,
m01 * right.m00 + m11 * right.m01 + m21 * right.m02,
m02 * right.m00 + m12 * right.m01 + m22 * right.m02,
0,
m00 * right.m10 + m10 * right.m11 + m20 * right.m12,
m01 * right.m10 + m11 * right.m11 + m21 * right.m12,
m02 * right.m10 + m12 * right.m11 + m22 * right.m12 ,
0,
m00 * right.m20 + m10 * right.m21 + m20 * right.m22 ,
m01 * right.m20 + m11 * right.m21 + m21 * right.m22,
m02 * right.m20 + m12 * right.m21 + m22 * right.m22,
0,
m00 * right.m30 + m10 * right.m31 + m20 * right.m32 + m30,
m01 * right.m30 + m11 * right.m31 + m21 * right.m32 + m31,
m02 * right.m30 + m12 * right.m31 + m22 * right.m32 + m32,
1
Performance (translationRotateScale + matrix multiplication):
LibGDX: 12 466k bones.
JOML mul(): 28 450k bones.
JOML optimized mul4x3(): 38 260k bones.
112 vs 63 mul/adds
public Matrix4f mul4x3(Matrix4f right, Matrix4f dest) {
dest.set(m00 * right.m00 + m10 * right.m01 + m20 * right.m02,
m01 * right.m00 + m11 * right.m01 + m21 * right.m02,
m02 * right.m00 + m12 * right.m01 + m22 * right.m02,
0,
m00 * right.m10 + m10 * right.m11 + m20 * right.m12,
m01 * right.m10 + m11 * right.m11 + m21 * right.m12,
m02 * right.m10 + m12 * right.m11 + m22 * right.m12 ,
0,
m00 * right.m20 + m10 * right.m21 + m20 * right.m22 ,
m01 * right.m20 + m11 * right.m21 + m21 * right.m22,
m02 * right.m20 + m12 * right.m21 + m22 * right.m22,
0,
m00 * right.m30 + m10 * right.m31 + m20 * right.m32 + m30,
m01 * right.m30 + m11 * right.m31 + m21 * right.m32 + m31,
m02 * right.m30 + m12 * right.m31 + m22 * right.m32 + m32,
1);
return this;
}