Java OpenGL Math Library (JOML)

theagentd · June 25, 2015, 7:58pm

Radians is what Java uses. Anything else is a visualization. If you want to use degrees with Java, you should use Math.toRadians(degrees) IMO.

Well, either have OpenGL programmers feel comfortable or Java programmers feel comfortable. In my opinion, Java programmers is the bigger group here, especially since most tutorials nowadays completely skip legacy OpenGL.

Roquen · June 26, 2015, 1:29pm

If anyone feels like implementing any quaternion functions…then please start a thread. Because virtually every paper on the subject is full of shit and I’ve never seen a publicly available library implement anything remotely useful. For real…I’m totally not joking.

KaiHH · June 26, 2015, 1:57pm

Thanks for your feedback!
As said earlier, I would like for anyone having any feature requests, enhancements or bugs, to post them as issues on GitHub. I would like to close that whole topic on JGO now.
Thanks again to all people that did provide valuable and constructive input to the development of JOML!

KaiHH · July 8, 2015, 8:26am

Just one note: JOML now uses radians consistently everywhere. See this issue.

theagentd · July 8, 2015, 1:34pm

Awesome!

gouessej · July 9, 2015, 1:17pm

Hi

It would be interesting to run those tests on your library to compare its performance to others:
http://lessthanoptimal.github.io/Java-Matrix-Benchmark/

There are already numerous libraries similar to JOML.

KaiHH · July 9, 2015, 1:27pm

It looks like those are all general-purpose linear algebra libraries, with functions to solve linear systems of equations, doing QR- and LU-decompositions and such, which JOML does not feature.
JOML is a special-purpose library for 4x4 and 3x3 single and double-precision floating point matrices with a limited set of functions operating on them that are generally useful in 3D applications.
In that regard, JOML is rather comparable to javax.vecmath or the math classes provided in libGDX.

gouessej · July 9, 2015, 8:46pm

In my humble opinion, a benchmark would still be welcome.

KaiHH · July 9, 2015, 9:06pm

Well, I would certainly be happy if someone conducts one.

But of course I do some testing myself all the time and so far I expect every (non-trivial) method in JOML to beat the counterpart in libGDX by some factor > 2 (which is not hard actually) and sometimes even orders of magnitudes.

The latter is especially the case for methods in libGDX’s Frustum class, because most methods in JOML are very cache- and inline-friendly and have low register pressure and contain no method invocations themselves.

Especially the now heavily optimized, inlined and unrolled Matrix4.isPointInsideFrustum() and isSphereInsideFrustum() methods can handle about 50 million! invocations in under 6 milliseconds for both cases where the point/sphere is inside the frustum and where it is not.

With isAabInsideFrustum() the numbers are about 100 milliseconds for 5 million invocations for boxes that intersect the frustum and about 40 milliseconds for boxes that do not.
Currently, that method is a modified implementation of “2.4 Basic intersection test” from this site.

theagentd · July 9, 2015, 11:09pm

I’m trying to write a small benchmark, but I’m missing the create-matrix-from-translation+orientation+scale function I need for my skeleton animation. Without it I’ll have to manually construct the matrix which will be much slower.

theagentd · July 10, 2015, 3:16am

Some initial benchmarks:

[tr][td]Test[/td][td]LibGDX[/td][td]JOML[/td][/tr]
[tr][td]Construct matrix from translation+quat+scale[/td][td]77 248k bps[/td][td]65 133k bps[/td][/tr]
[tr][td]Full bone construction (mul)[/td][td]12 660k bps[/td][td]25 146k bps[/td][/tr]
[tr][td]Full bone construction (mul4x3)[/td][td]12 660k bps[/td][td]29 842k bps[/td][/tr]
[tr][td]Construct matrix + invert[/td][td]11 431k bps[/td][td]15 610k bps[/td][/tr]

*bps = Bones per second

The lack of an optimized function to create a matrix from a translation, orientation and scale makes LibGDX slightly faster at that test, but once you throw in a multiply with a bind-pose matrix to construct a “full” bone JOML wins easily, being ~2x faster. The optimized mul4x3() function of JOML gets us even further, giving us 2.36x better performance than LibGDX. That’s really surprising considering LibGDX actually has native code for accelerating stuff like this. Seems like the overhead outweighs the gains there. In the matrix inversion test, JOML is 1.37x faster. My guess is that most of these gains come from the fact that LibGDX matrices store their values in an array while JOML uses normal variables. That’s one less cache miss and apparently less overhead when accessing each matrix element.

The results were identical between the two libraries as far as I could tell.

I’m not a big fan of JOML’s way of visualizing floats in toString() methods. It always shows them in the power of 10 form, which is a bit confusing to get an overview of at a glance.

Gonna compare the frustum culling code you have with the one I’ve written myself and see if there are any improvements there tomorrow. I suspect there are.

Further suggestions:

That matrix construction function would also be useful for getting up to par with LibGDX in the first test.
I see you made an multiply-and-add function (fma())! Awesome! It only takes in two vectors though, so one that takes in a float for the multiplier would be nice, fma(Vector3, float).
The arguments of fma() could be given better names. They’re currently (v1, v2), which says nothing about what they do. May I suggest “add” and “multiplier” for example?
Scalar version of Vector*.add() and sub() would be nice too. I have at least one place in my code that does that.
Many functions in quaternion also implicitly normalizes the quaternion. The weirdest one is invert(), but many others do too. I believe the user should be in charge of normalizing quaternions and providing normalized inputs to functions that need it.

Questions:

A number of quaternion functions seem to internally use doubles. Is there a reasoning behind that?

Riven · July 10, 2015, 7:34am

Hotspot regularly hoists fields into local variables. Once that has happened, it will optimize localvar access aggressively, converting lots of raw memory-operations into purely register-operations. This ‘hoisting’ is not performed as often (in Hotspot) for array operations, as their access patterns are harder to analyze and predict. My guess is that that causes the majority of the difference between conceptually similar operations in LibGDX and JOML. The additional indirection is most likely hidden behind main-memory/cache latency, which is orders of magnitude greater.

TL;DR: try to get Hotspot to keep intermediary results into registers by using instance fields and (therefore) local variables, as long as the cache-trashing caused by objects does not become the bottleneck.

KaiHH · July 10, 2015, 8:34am

Matrix4.translationRotateScale() is in.

It’s the same as doing translation().rotate().scale(), just in a single method where I built the method by doing the whole three transformations first and then reducing operations that were known to produce ones or zeroes.

(edit: thanks for suggesting to add this little method! It’s about 60x faster than doing the three steps manually. 90 million invocations with some translation, rotation and scaling now only take 40 milliseconds… compared to 2.4 seconds with the manual approach.)

And as Riven pointed out, the only reason why JOML is slightly faster is likely due to that field/register optimization. JOML doesn’t do anything special or clever there, because there is simply no other way that you can implement 4x4 matrix multiplication or inversion differently than what JOML or libGDX or any other library does, on the arithmetics side.

However, I do believe that with the is*InsideFrustum() methods, JOML took the fastest possible road (not needing to build a Frustum class or a Plane class or normalizing the plane ‘normals’, or building the planes out of NDC-unprojected points, etc.), while still being generic (coping with arbitrary matrices) and not making use of temporal coherency (that would be the next step in optimization, also proposed by the paper I implemented the algorithm from).
But that optimization falls more in the realm of a real game engine.
So, make sure to use the latest HEAD when benchmarking JOML on these functions.

Roquen · July 10, 2015, 8:47am

implicity normalizing quaternions is a bad idea.

KaiHH · July 10, 2015, 12:28pm

The ground works for temporal coherency caching of AABB-frustum intersection tests is in now!

There is an additional method isAabInsideFrustumMasked() which takes a bitmask of the 6 possible planes to check against the box.
It linearly scales in the number of active planes in that bitmask.

Now, the semantics of isAabInsideFrustum() has changed to not return just true or false, but instead return the index of the first tested plane that culled the box.
This index can now serve as the plane mask (applied via 1<<index) into isAabInsideFrustumMasked() which will then only check if that plane “still” culls the box.
Ogre also does something like this and it has the potential to dramatically speed up AABB-frustum culling when applied in a game.

theagentd · July 10, 2015, 1:15pm

Riven:

Hotspot regularly hoists fields into local variables. Once that has happened, it will optimize localvar access aggressively, converting lots of raw memory-operations into purely register-operations. This ‘hoisting’ is not performed as often (in Hotspot) for array operations, as their access patterns are harder to analyze and predict. My guess is that that causes the majority of the difference between conceptually similar operations in LibGDX and JOML. The additional indirection is most likely hidden behind main-memory/cache latency, which is orders of magnitude greater.

TL;DR: try to get Hotspot to keep intermediary results into registers by using instance fields and (therefore) local variables, as long as the cache-trashing caused by objects does not become the bottleneck.

I intentionally work on a public static array of 4096 translations, scales, etc to get around that, since that’s more like what I do in practice. That being said, it seemed much easier for JOML to get stack allocation than LibGDX. Before I made them static variables, JOML could easily be 10x faster than LibGDX.

I’m fairly sure that because LibGDX’s matrices rely on an internal 16-element array to hold its matrix elements, it gets some overhead from that. Hotspot can most likely optimize JOML’s function better.

It’d be nice if you could give me some feedback on my other suggestions and questions as well.

theagentd · July 10, 2015, 2:17pm

Just took a minute to try out the new translationRotateScale() method.

The name is inconsistent. translationRotationScale() would be more in line with the other method names that set the matrix to a given value.
The arguments are weird. [icode]float tx, float ty, float tz, Quaternionf quat, float sx, float sy, float sz[/icode] should be [icode]float tx, float ty, float tz, float qx, float qy, float qz, float qw, float sx, float sy, float sz[/icode], and you should add another convenience method that takes in Vector3f and Quaternionf arguments instead: [icode]Vector3f translation, Quaternionf rotation, Vector3f scale[/icode].

KaiHH · July 10, 2015, 2:26pm

translationRotateScale was chosen because it reflects the intermediate operations translation().rotate().scale() very nicely, and thus can be thought of as a condense form of doing these three methods, which I find quite nice to read.
And if in the future there will be more condense methods of other combinations, this can follow that scheme, too. So people switching from the “chain” of intermediate operations to the “condense” form just need to erase those dots.
I agree that the method should use all-primitives and all-objects overloads, though.

theagentd · July 10, 2015, 2:52pm

New results using the new translationRotateScale()!

[tr][td]Test[/td][td]LibGDX[/td][td]JOML[/td][td]Speedup[/td][/tr]
[tr][td]Construct matrix from translation+quat+scale[/td][td]78 310k bps[/td][td]86 339k bps[/td][td]10.3%[/td][/tr]
[tr][td]Full bone construction (mul4x3)[/td][td]12 171k bps[/td][td]32 110k bps[/td][td]163.8%[/td][/tr]
[tr][td]Construct matrix + invert[/td][td]11 386k bps[/td][td]16 633k bps[/td][td]46.1%[/td][/tr]

LibGDX’s matrix multiply was slower today… >___> Anyway, the gains are real. Looking at the source code of the LibGDX version of translationRotateScale(), the math done is identical. The only difference is that the matrix elements are in an array instead of simple fields, and that apparently gives a 10.3% boost in performance alone. Real nice.

theagentd · July 10, 2015, 4:53pm

Culling results:

[tr][td]Test[/td][td]My culler[/td][td]% visible[/td][td]JOML[/td][td]% visible[/td][/tr]
[tr][td]Point culling[/td][td]67 010k[/td][td]1.243%[/td][td]68 326k[/td][td]1.243%[/td][/tr]
[tr][td]Sphere culling[/td][td]65 962k[/td][td]1.6414%[/td][td]66 992k[/td][td]10.3498%[/td][/tr]
[tr][td]AABB culling[/td][td]40 193k[/td][td]1.483%[/td][td]48 327k[/td][td]98.517006%[/td][/tr]

Well, it’s a tiny bit faster than mine, but also extremely prone to false positives for sphere and AABB culling.

I’m starting to doubt the usefulness of this. Many of the values can be precomputed for better performance in a separate Culler class. The improved performance comes from unrolling the plane loop (I store each plane as a small Plane object), and you can get even further if you precompute the planes I think. In addition, you often want to do a distance based test first to eliminate 90% of all points, and when I enable that my culler actually wins easily. I also believe that to correctly cull volumes (spheres/AABBs) you need to have normalized planes or you’ll get both false positives and negatives.

EDIT: Spheres had a diameter of 2 and AABBs a side of 2.