Java OpenGL Math Library (JOML)

It’s a jit: https://github.com/JOML-CI/JOML/blob/native-interpreter/native/src/codegen.dasc

The idea is that your regular java code (chains of JOML methods) produce a native function in memory at runtime.

Here’s the pure java test case: https://github.com/JOML-CI/JOML/blob/native-interpreter/test/org/joml/NativeMatrix4f.java#L108
In the current impl, the Sequence object is essentially a function pointer to the native function, you set it’s arguments and call it.
NativeMatrix is a builder pattern for an opcode buffer that gets compiled to SSE code when you call terminate().

Consider that an animation hiearchy most often is a static trie. A breath first walk yields a flat linear ordering of memory chunks. Each logical node can be a quaternion, a vector and the number of children which although an integer can be stored as a single. It’s cheaper to xform 1 vector directly than to pass through LA. So no struct library is really needed in this case. There are a number of variants on this theme.

Oh and I intended to point out that even if you go with a structure where you need to xform multiple the actual matrix doesn’t needed to be stored anywhere…it just temp expanded in registers.

I’m not entirely sure what you’re talking about, but I’m not using MappedObjects for my skinning. I experimented with it for a physics simulation which had a much higher number of cache misses, and the gains were only on AMD processors which have a much worse memory controller.

Has anyone considered using OpenCL with a CPU device for this type of thing? What Kai is doing is basically compiling and running a simple OpenCL kernel.

i was wondering too already. if you go down the rabbit hole that deep - why not use C or openCL in the first place ?

Yes, OpenCL would be very cool! If anyone wants to contribute an OpenCL backend for JOML, that’d be awesome!
But for the time being I really enjoy x86 assembly programming, which I have never ever done any assembly programming before.
And it is really fun! All those segmentation faults. :slight_smile:

Just changed the code generator to use unconditional jumps with DynASM dynamic labels (I just have to say, DynASM is an AMAZING tool!) and got for the same 100 bulked operations with 10,000 executions now a speedup of over 600% !!!
(edit: if the JVM has a bad day, it’s even over 760% sometimes!)
Absolutely same resulting end result matrix, just 600% faster. :smiley:
The code generator does not emit linear code anymore, but only emits a used matrix function one time and uses unconditional jmp to a generated jump labels/addresses to go back to the next statement.
I am sure there is much more that can be improved (like what @Roquen rightly pointed out) and I will sometimes end up at even 1000% speedup.

EDIT: A bit on how it’s done:
I used a slightly different method compared to what this awesome tutorial(edited!, was wrong link) did. They emitted an immediate jump to a dynamic label address, because they emit code for every loop separately (they use that technique for nested loops).
This I could not do, sadly.
I needed a way to do what call/ret does, namely storing the address of a label in a register and then unconditionally jumping to it.
However some stackoverflow thread pointed out that push eax and then ret would not be optimal for a CPU for prediction or something.
Therefore I used rdx to store the address/label of the next operation in the batch, as that was the next register that was free. (rcx is used for the “arguments” ByteBuffer), and then let DynASM generate a new free label.
Every one-and-only generated matrix operation function at the end always just “jmp rdx”.

EDIT2: @Spasi, I do not deserve that star: the code generator had a bug actually, and now it’s back to “just” 260% faster, but actually correct. :slight_smile:

EDIT3: However, since the code size does not grow as fast as before, doing a batch of 1000 matrix multiplications is 360% faster, and 10,000 also.

Now with the dynamic label jumping (aka threaded code) this is closer to an embedded Forth than ever.

Yes. And thanks for that “threaded code” link! Did not know of that code generation “scheme”.
I am still new to native code, and that technique seemed to be the most natural thing to do to get small and fast code. :slight_smile:
I also think people can find a lot of resemblences in many other languages and runtime environments.

It’d be interesting if you could make something I could test in the same way I tested the earlier stuff. I’d just need native implementations of mul (preferably mul4x3) and translationRotateScale().

I’m working on making the jitted ‘matrix construction templates’ (as I call them) work for transforming FloatBuffers containing many matrices at once. Would that be a relevant use case @theagentd? That’s what I got out of your prior example with queues and seems like a use that would actually benefit from this tech. (1 JNI call to transform N matrices per unique op sequence)

A working implementation of mul is already in.
(edit: I am adding a translationRotateScale as well, but I doubt it will be any faster than Java’s, but we’ll see. Just need to figure out how to vectorize that best. :))

I keep a basic test always up to date:
https://github.com/JOML-CI/JOML/blob/native-interpreter/test/org/joml/NativeMatrix4f.java
(the main method at the end)

Although the API is very far from beautiful currently, needing some boilerplate code and is easily used very wrongly. :slight_smile:
One thing I had recently was completely stunning my PC by doing 10 million invocations of NativeMatrix4f.mul(), which was quite bad, since that created a very large opcode buffer and then tried to generate executable code from it… :smiley:
It is safe to do around/atmost 10,000 operations on a matrix before terminate()'ing it to JIT the native code.

Do you need a pre-compiled Win64 library?
(edit: oh and I think I am only going to support Windows x64 during the development phase, just so much pain to spool up a Linux VM :slight_smile:

Ehhhhh, I think I’ll wait until it’s a bit more ready then. X___X

To describe my entire use-case a bit more: Each bone is represented by a translation, an orientation and a scale. A bone’s parameters are interpolated using slerp/lerp from permanently loaded animation bones, which have the same parameters. I also have some other functions that allow the user of the engine to override the parameters of a bone to a fixed user-specified value.

After a bone’s parameters have been interpolated/updated, the bone is transformed by its parent.


				parentOrientation.transform(translation).scl(parentScale).add(parentTranslation); //rotate translation by parentOrientation, then scale by parentScale, then add parentTranslation
				orientation.mulLeft(parentOrientation);
				scale.scl(parentScale);

After all, the bones are all converted to matrices, multiplied with the (precomputed) inverse bind-pose matrices of the model and finally it’s stored in a ByteBuffer. This is by far the most expensive part actually.

@theagentd: Your first proposal of an API with your usecase of the queues and the matrices being created on that queue was actually pretty good and I did it this way now. Feels nice.

You can now do something like this:


Sequence seq = new Sequence();
NativeMatrix4f m1 = new NativeMatrix4f(seq);
NativeMatrix4f m2 = new NativeMatrix4f(seq);
{
    m1.identity().rotateZ((float) Math.PI / 2.0f);
    m2.identity().rotateX((float) Math.PI).transpose();
    m1.mul(m2);
    ...
}
seq.call();
// Store the NativeMatrix4f 'm1' into a Matrix4f:
Matrix4f m3 = new Matrix4f();
m1.get(m3);

The above shown sequence is of course absurdly simple, but you get the point. :slight_smile:
I am about to translate the first (simple) Matrix4f methods to the NativeMatrix4f. Those provide the same flexibility that the existing Matrix4f has: Operate on “this” or store the operation result in a “dest” matrix.
You can build pretty complex transformations with that, which execute when you call “call()” on the sequence.
Every NativeMatrix4f holds its NIO Buffer of at least 4*16 bytes. This NIO Buffer can of course be shared with many NativeMatrix4f instances, each of them using their own offset into the buffer.

EDIT: The API begins to solidify (if one can say that that early). Once a single Sequence instance is terminated (either by explicitly calling terminate() or implicitly by the first call()), it can now be reused (as it was intended in the first place) for multiple invocations of the same sequence of operations that are already stored in it and jitted. This is necessary in order to update the actual arguments to the sequenced operations via the builder classes (NativeMatrix4f, NativeVector4f, …) to keep the client code look the same in all cases (terminated and not terminated; i.e. terminating an already terminated Sequence does nothing).
There are validations in place to check whether you always call the same sequence of methods on that “Sequence”, because breaking the sequence can have unpredictable consequences (most likely process crash).
This is because once the sequence is terminated, its operations are jitted to a native function, which then simply expects the actual arguments in a given order, that was determined when writing the sequence the first time.

The reason I proposed that was for the simplicity of it. It might not have as insanely good performance as possible alternatives, but it does seem extremely easy to implement. The only real addition you have to make is to create a Sequence and make the matrices “native”. That’s convenient enough for people to be able to use it in every-day scenarios.

Okay, having investigated HotSpot’s code it generates I must say it is doing a pretty darn good job. Like @Riven said, HotSpot keeps fields in registers and that even over many method invocations. It looks like it knows that the same object is used again in a subsequent method and uses barrier checks to see if that is still the case, and if it is not, it spills the values into main memory.
HotSpot also seems to “evolve” the native code it generates for a method.
For a simple test case I did with Matrix4f.mul() it generated that method four times, always with slightly more optimized instructions and keeping more things in registers as compared to the stack or generlly the main memory.
So far, JOML only outperforms HotSpot for the first 10,000 or so invocations.
After that, JOML is on par and even sometimes slower than HotSpot, and it looks like this is because HotSpot keeps all the matrix fields in the XMM registers, which is an amazing job.
However, JOML can also make use of that: If for example it sees a sequence of two matrix operations which operate on the same source and destination matrices, it need not spill the computed matrix columns into main memory again, but keep it in XMM registers.
Let’s see how that will improve, since I do believe that memory latency is currently the bottleneck.
After that is eliminated, in the optimal case the only difference between HotSpot and JOML would be that HotSpot still only emits scalar instructions whereas JOML uses vector instructions and has the JNI overhead. So in theory, JOML could be at max 4x as fast as HotSpot minus JNI overhead.

Your experiments are very interesting. While I don’t make 3D games, it’s nice to see how code can be heavily optimised.

I’ve read that Java’s JIT compiler Hotspot uses SIMD (single instruction multiple data) in array copies and fills:

I haven’t heard of hotspot using SIMD in anything else, so your changes are a very cutting edge optimisation. It will be interesting to see if there will actually be a significant speed up.

Yepp. I am beginning to doubt that there will be really significant benefits from that, probably 2x as fast. And the amount of work needed now becomes a bit in disproportion to the actual gains. But so far it was an interesting venture and I’ve ordered “Modern X86 Assembly Programming” for future projects.

So the bottom line I take from all this is probably: Whoever thinks that Java is slow,… think again! :slight_smile:

I went through similar discoveries when working on LibStruct.

Main lessons learnt would be:

  1. retaining data in a register is crucial for performance.
  2. memory access (even L1 cache) is slow.
  3. most diehard optimizations will be hidden/thwarted by memory latency.

The only way to do vector math really quickly is to write dedicated ASM that takes #1 into account and works on bulk data, with loop unrolling in ASM, minimize the amount of jumps, conditional or not. Also note that SSE processes 128 bits in one go (or 256 bits in SSE5), but due to latency my typical measured speedup was factor 1.7-2.2 in real world scenarios, in micro-benchmarks the gains were greater, but… who cares.

Usually it’s more productive to create/generate HotSpot friendly bytecode, than to gnerate ASM, as your ASM is bound to lack all kinds of peephole optimizations (like complex reordering of instructions) or maybe even basic stuff like memory pre-fetching. Let HotSpot do its magic, even if that means the lack of SIMD.