Java OpenGL Math Library (JOML)

The normalization optimization is a pretty good idea. It’s safe to assume that multiplies are much cheaper than divides when it comes to floats and doubles, so calculating 1/length and multiplying all 3 components by that value is faster than dividing them by length. LibGDX does this. This applies to all normalization JOML does.

Concerning abc != dbc: This is because Java strictly follows the ordering. abc = (ab)c while dbc = (d*b)c, so in this case the compiler can’t first calculate bc and then reuse the result for both calculations. The lesson to take home is that if you want a value to be guaranteed to only be computed once, you should manually calculate it in a temporary variable.

Concerning the branch to avoid local variables, it sounds like a bad idea. I tried commenting out one or the other in my skeleton animation test and there was no difference in performance. It just makes the code harder to read and maintain IMO.

Also, I believe mul4x3 can be optimized quite a bit.

Red = 0.0, blue = 1.0.
m00 * right.m00 + m10 * right.m01 + m20 * right.m02 + m30 * right.m03,
m01 * right.m00 + m11 * right.m01 + m21 * right.m02 + m31 * right.m03,
m02 * right.m00 + m12 * right.m01 + m22 * right.m02 + m32 * right.m03,
m03 * right.m00 + m13 * right.m01 + m23 * right.m02 + m33 * right.m03,
m00 * right.m10 + m10 * right.m11 + m20 * right.m12 + m30 * right.m13,
m01 * right.m10 + m11 * right.m11 + m21 * right.m12 + m31 * right.m13,
m02 * right.m10 + m12 * right.m11 + m22 * right.m12 + m32 * right.m13,
m03 * right.m10 + m13 * right.m11 + m23 * right.m12 + m33 * right.m13,
m00 * right.m20 + m10 * right.m21 + m20 * right.m22 + m30 * right.m23,
m01 * right.m20 + m11 * right.m21 + m21 * right.m22 + m31 * right.m23,
m02 * right.m20 + m12 * right.m21 + m22 * right.m22 + m32 * right.m23,
m03 * right.m20 + m13 * right.m21 + m23 * right.m22 + m33 * right.m23,
m00 * right.m30 + m10 * right.m31 + m20 * right.m32 + m30 * right.m33,
m01 * right.m30 + m11 * right.m31 + m21 * right.m32 + m31 * right.m33,
m02 * right.m30 + m12 * right.m31 + m22 * right.m32 + m32 * right.m33,
m03 * right.m30 + m13 * right.m31 + m23 * right.m32 + m33 * right.m33

Optimized:
m00 * right.m00 + m10 * right.m01 + m20 * right.m02,
m01 * right.m00 + m11 * right.m01 + m21 * right.m02,
m02 * right.m00 + m12 * right.m01 + m22 * right.m02,
0,
m00 * right.m10 + m10 * right.m11 + m20 * right.m12,
m01 * right.m10 + m11 * right.m11 + m21 * right.m12,
m02 * right.m10 + m12 * right.m11 + m22 * right.m12 ,
0,
m00 * right.m20 + m10 * right.m21 + m20 * right.m22 ,
m01 * right.m20 + m11 * right.m21 + m21 * right.m22,
m02 * right.m20 + m12 * right.m21 + m22 * right.m22,
0,
m00 * right.m30 + m10 * right.m31 + m20 * right.m32 + m30,
m01 * right.m30 + m11 * right.m31 + m21 * right.m32 + m31,
m02 * right.m30 + m12 * right.m31 + m22 * right.m32 + m32,
1

Performance (translationRotateScale + matrix multiplication):

LibGDX: 12 466k bones.
JOML mul(): 28 450k bones.
JOML optimized mul4x3(): 38 260k bones.

112 vs 63 mul/adds


    public Matrix4f mul4x3(Matrix4f right, Matrix4f dest) {
        dest.set(m00 * right.m00 + m10 * right.m01 + m20 * right.m02,
                 m01 * right.m00 + m11 * right.m01 + m21 * right.m02,
                 m02 * right.m00 + m12 * right.m01 + m22 * right.m02,
                 0,
                 m00 * right.m10 + m10 * right.m11 + m20 * right.m12,
                 m01 * right.m10 + m11 * right.m11 + m21 * right.m12,
                 m02 * right.m10 + m12 * right.m11 + m22 * right.m12 ,
                 0,
                 m00 * right.m20 + m10 * right.m21 + m20 * right.m22 ,
                 m01 * right.m20 + m11 * right.m21 + m21 * right.m22,
                 m02 * right.m20 + m12 * right.m21 + m22 * right.m22,
                 0,
                 m00 * right.m30 + m10 * right.m31 + m20 * right.m32 + m30,
                 m01 * right.m30 + m11 * right.m31 + m21 * right.m32 + m31,
                 m02 * right.m30 + m12 * right.m31 + m22 * right.m32 + m32,
                 1);
        return this;
    }

If you read the JavaDocs, it says that it assumes the right matrix to be 4x3. I did this because it could happen and is likely that you would first build a projection matrix and then concatenate a 4x3-only matrix on that. If that is not the case and you never concatenate your full projectionview matrix using mul4x3 we can change that, so that also the left matrix is assumed to be 4x3.

EDIT: I introduced a new Matrix4.mul4x3r, which behaves like the current/old mul4x3 by only assuming that the last row of ‘right’ is (0, 0, 0, 1). The new version of Matrix4.mul4x3 has your proposed changes, by assuming that both ‘this’ and ‘right’ have (0, 0, 0, 1) as last row.

Awesome! The usefulness of a 4x3 multiplication lies in skeleton animation, where both the inverse bindpose matrix and the bone transformation matrix are are both "matrices without projection, and calculating bones for a single skeleton is probably heavier than all projection and view matrices you calculate each frame, so it is much more important to optimize it.

Yes, of course. I totally agree.
So basically, it’s always: “Tell me what the hot path of your code is, and then let’s optimize JOML on that.” :slight_smile:

Also, thanks @Roquen, for your input. At least I saw that using multiplying by the reciproce could in theory be better than dividing all the time. :wink:
I implemented that.
Additionally, I finally came to changing the contract of each method to now return ‘dest’, or in general “the parameter being manipulated.”
And then I finally removed those ‘with()’ methods, since with the introduction of the recent changes, those are even more unnecessary.
But you have to admit, if not they provided any functional benefit to JOML, they did satisfy some sociological aspect of allowing people to bond together and vent their wrath on them. :wink:

So, if you use the Maven infrastructure (Maven, Gradle, Ivy, …) then please now use 1.4.3-SNAPSHOT.

I cannot medal-slap you hard enough, KaiHH.

For what it’s worth I’ll be switching from gdx vecmath to JOML too for my game libraries (largely as a result of @theagentd working on our engine code). So keep up the good work :wink:

Cas :slight_smile:

Glad to hear!
Please keep feeding me with info on what your use cases are; either on GitHub or here - though personally I prefer having a traceable issue on GitHub over a post on JGO.
The more people make use of JOML, the better it can become.

By the way, I forgot to mention it earlier, but the 48—>68 FPS increase I got from JOML included calculating all physics, model matrices and camera matrices at double precision. The physics were noticeably slower (10-20% longer CPU time) due to this. Double precision did help a lot with motion blur inaccuracies during slow-motion (difference between two positions scaled up by a large number) and also reduced model and camera jitter a lot. This led to less motion blur and anti-aliasing reprojection artifacts in the scene.

EDIT: Skeleton animation was of course done at float precision.

How you are doing motion blurring? Sounds bit odd that you would ever have precision issues with it. I just use last frame mvp matrices which are always exact.(if object wasn’t visible on previous frame I use current matrices which is not problem for temporal AA and hardly problem for motion blur.). Actual motion vectors are scaled from difference vectors using frame time and camera shutter time.

I had problems with motion vectors stuttering during slow motion. With time running 1/20th as fast, the difference between in positions between each frame was minimal and the motion vectors are scaled up based on 1.0/timeStep to compensate. This amplifies floating point errors in the object and view matrices enough to cause seemingly stationary objects to get long enough motion vectors to cause motion blur to kick in. Running at a high resolution further increased the floating point errors. Computing object physics, object matrices and camera matrices with double precision and only convert them to floats when uploading them to OpenGL solves this.

This isn’t motion blur anymore. Camera based motion blur is based on shutter speed. For Hardland I chosed shutter speed to be 1 /100 s. So with 60fps framerate I scale motion vectors with (1/100) / (1/60) = 0.6. If framerate dips lower scaling value just goes smaller. For realistic motion blur you never can’t have longer shutter time than frame rate.
if you crank up motion blur with slow motion it might be cool effect but it’s not actually motion blur effect anymore.

JOML 1.5.0 (and its siblings ‘mini’ and ‘2d’) is now on Maven Central, containing all the recent changes.
It can be used with Maven, Gradle and Ivy right away now.
(it takes some hours for the search.maven.org/ online site to pick the new version up)

You’re right, the motion blur length depends on how long the shutter is open, which obviously can’t be longer than the frame time of the virtual “camera” you have, but I find it more important to keep the motion blur independent of framerate (like you do too). The slow-mo is a stylish effect, yeah, so there’s really no point in trying to justify it. I also noticed reduced jittering of objects and the camera (which follows one of the objects). Simply put, the matrices had much higher precision, limited only by the 32-bit float precision they were uploaded in to OpenGL instead of suffering precision issues during calculations.

I’ll be back from vacation soon so I’ll be able to be extra annoying (and maybe I’ll complete that write-up that I started). I just too a closer look at the updated slerp. Remember how I said this is a function you never need to call? (It’s only use should be for reference implementations for error validation) The current version of slerp validates my point. (I’m being vague on purpose…do the math. Only trig is needed here.)

I took a look at the new slerp() implementation for the first time and compared it with the LibGDX implementation. Although they seem to be very similar, I think the LibGDX one is more… “elegant”. It basically sets itself up for nlerp first, and if the dot product is below a certain threshold it modifies the coefficients to do a proper slerp instead. That only requires one if-statement and keeps the code minimal IMO.

http://www.java-gaming.org/?action=pastebin&id=1326

I tried to apply some basic trig identities to optimize it, but I didn’t think it was possible to derive sin(ax) from cos(x) without going through angles. At first I couldn’t figure out how to derive sin(ax) and sin((1-a)x), but then I read your excellent response and managed to figure it out in minutes! Just kidding, your response was as useful as usual. Seems like at best you can replace the sin(x) calculation with sqrt(1 - coscos) to get rid of a sin(x) which should be a little faster.

I’m writing up the derivation of slerp in the other thread…all the important parts are semi stubbed out there. What happens to sin((1-a)*x) should be obvious.

Forget that part…how does this implementation of slerp help show than slerp isn’t useful (other than for ground truth testing).

ease of use ?

anyway, here’s my slerp version which is somewhat ported from GLM, which actually looks almost like agents version.

http://pastebin.java-gaming.org/e8d638d2f3813

I am being illogical about the slerp derivation since I think that it’s obvious. In the other thread I am building from scratch so I’ll give some points about starting from the common equation. If you’re only interested in minimal paths…which you always are…add to that q and -q represent the same rotation, then the dot is on [0,1], So the relative angle is on [0, Pi/2]. Not negative because that direction comes relative information. So cos of angle is dot, the sine from the root. Use all of that for atan. For the parameterized angles…expand the the sum of angles and complete the derivation.

If by that you mean performance doesn’t matter and I am lazy… that is always reasonable.

yes, or clarity, ease of reading.

but you’re right, that’s if you do not have too many of them around - sometimes 0.001 ms is not much time :slight_smile:

edit which other thread ?