Java OpenGL Math Library (JOML)

theagentd · July 12, 2015, 12:23am

I posted it above.

http://www.java-gaming.org/?action=pastebin&id=1307

It’s only faster when the distance test is used first, and when the level is significantly bigger than the frustum so the distance test actually culls anything.

theagentd · July 12, 2015, 5:26am

Quaternion functions that implicitly normalizes something:

get(AxisAngle4f). Does an invalid (?) check and renormalizes the quaternion.
rotationAxis(float angle, float axisX, float axisY, float axisZ) implicitly normalizes the input axis (not the quaternion).
invert(). I don’t think having a separate unitInvert() in there is good. It’s the programmers responsibility to give good input.
div(Quaternionf b, Quaternionf dest) normalizes the input quaternion b.
lookRotate(float dirX, float dirY, float dirZ, float upX, float upY, float upZ, Quaternionf dest) normalizes a lot of stuff weirdly. It normalises the dirXYZ vector but not the upXYZ vector. It shouldn’t normalize either IMO.
rotationTo(float fromDirX, float fromDirY, float fromDirZ, float toDirX, float toDirY, float toDirZ) normalizes the input stuff.
rotateTo(float fromDirX, float fromDirY, float fromDirZ, float toDirX, float toDirY, float toDirZ, Quaternionf dest) as above.
rotateAxis(float angle, float axisX, float axisY, float axisZ, Quaternionf dest) normalizes the axis.
difference(Quaternionf other, Quaternionf dest) normalizes the other quaternion but not itself.

I completely think that normalizing should be the user’s responsibility. In practice all functions expect normalized inputs, so it’s hardly difficult to for the user to keep track of. Those extra unnecessary normalizations do cost some performance after all.

Why? Do you like having uninterpolated or distorted bone animation?

Roquen · July 12, 2015, 1:04pm

By no implicit normalization I only referring to operations on quaternions. Conversions in & out need to do the correct thing. ‘unitInvert’ just needs to be properly named, but the conjugate is an axiomatic function.

The difference function is another poorly named function and no normaliztion should be performed. Of more general interest would be cmul: AB and mulc AB, which covers and is much more efficient that this.

There are a large number of useful functions I could suggest. Off the top of my head:

Cayley (stereographic) & inverse projections
unit log
pure exp
orientation change: get Q that rotates unit vector A into unit vector B (this reduces further if actually a matrix is desired)
unit square root
unit square

Having a function called div is IMO awkward since there’s right and left versions, but since most people only care about unit quaterions I probably wouldn’t include either. Those that do care will most of the time be able to carry through the derivation of the total desired operation.

Remember that there’s absolutely nothing special about unit quaternions. Non-unit quaternions have uses.

On slerp: Implementations like this have zero use cases. I’m too lazy to have this conversation from my cell phone though. Basically slerp is just linearly parameterizing an arc length (2D) and the angle is no more than PI/2 for random input. It’s always reducible to lerp. The only question is the method of reduction. If you want a quick answer try twitting @rygorous

gouessej · July 12, 2015, 4:29pm

Ok but only the unit quaternions represent rotations.

Roquen · July 13, 2015, 12:10am

Not true. All quaternions other than zero represent a rotation.

theagentd · July 13, 2015, 7:31am

We’ve decided to make the move to JOML for Insomnia after the release of Demo 7.

KaiHH · July 13, 2015, 8:03am

Glad to hear that! If there is anything we can do to make JOML easier or faster for you, please let us know!
Additionally, I am investigating a possible performance improvement for JOML using native SSE code. See this enhancement issue: https://github.com/JOML-CI/JOML/issues/30
I’d be happy to hear about any suggestions you have on this.
By the way: org.joml:joml:1.4.0 and org.joml:joml-mini:1.4.0 are now on Maven Central (the first release on Maven Central actually).
Intermediary snapshot releases will still be available on Sonatype’s snapshot repository:
https://oss.sonatype.org/content/repositories/snapshots and the next joml release on Central will likely be 1.5.0 in two weeks.

Roquen · July 13, 2015, 8:30am

There are much bigger gains to be had than sse atm. Notably lowering memory stalls. These two go hand in hand. The latter needs to be addressed before the former has what it needs. And as mentioned previously there’s quite a bit of reworking that can be performed.

KaiHH · July 13, 2015, 8:52am

@Roquen, I really like your comments.
Btw. aren’t you supposed to be on vacations by now? So why are you on your laptop/mobile?
Enjoy the beach/sun/mountains/culture or wherever it is you are…
Now, I like your comments because they fall in line with either of the following:
a) this is bad
b) that should not be done this way
c) people would never want to use this
Everytime I read one of your comments, I can categorize them in either a), b) or c).
That makes it really easy for me, thanks!
But seriously now: I would happily implement any suggestion you have, if they:

a) come in a clearly stated form which propose how something can be changed for the better
b) would be useful to YOU to help you with YOUR projects
The point b) will be the motivation for me to actually do it, and a) would tell me what it is that must be done.
So, if you can outline more where you see memory stalls and what we can do about it, I would be happy to fix that.
Again, thanks for your time and thoughts.

gouessej · July 13, 2015, 12:04pm

Reference (page 4):
http://www.cs.ucr.edu/~vbz/resources/quatut.pdf

[quote] If N(q) = 1, then q = [vˆ sin W, cos W] acts to rotate around unit axis vˆ by 2W
[/quote]

[quote]we can henceforth assume q is a unit quaternion
[/quote]

[quote] it has to be a unit quaternion
[/quote]
Edit.: Sorry for the off topic.

Roquen · July 13, 2015, 4:00pm

Split quaternion topic non specific. I am on vacation cell only so more terse than usual.

I can back up all my commentary…ask for specifics if my reasoning isn’t clear.

KaiHH · July 14, 2015, 9:41am

Just one info for high-performance people like @theagentd.

I implemented a simple runtime JIT code generator using DynASM, which currently supports Matrix4f.mul(Vector4f), making use of SSE instructions (movaps, shufps, mulps and addps).

The astonishing results are: Even with JNI overhead this function is 8% faster than the corresponding scalar Java code on an i3-2120…
(I did a benchmark with 100 million invocations). Resulting numbers were the same, but the JNI/SSE version was faster.

So JOML is going SSE and will get an optional acceleration JNI library!

Riven · July 14, 2015, 9:45am

The real wins should be in bulk-operations within one JNI call - any numbers on that?

KaiHH · July 14, 2015, 9:47am

Yepp. That’s what I am after in the long run, and why I am using DynASM runtime code-generation. But I am really amazed now, how much faster even a single non-batched operation is.
I will update here when I get to implement the batching soon.

princec · July 14, 2015, 9:51am

If you go that route remember you’ll be needing to provide precompiled binaries for x86 and amd64, for Windows, Linux and MacOS

Cas

Riven · July 14, 2015, 9:52am

Don’t forget ARM6 and ARM7, as this might be valuable on Android :point:

theagentd · July 14, 2015, 10:53am

Something that would maximize the throughput would be the ability to queue up get()s as well. In my case I will only be doing 3 functions:
[icode]
matrix.translationRotateScale(…).mul(…).get(directBuffer);
[/icode]
This tiny code snippet might be run 100 000 times per frame though. That’s still 100 000 JNI calls. If I could simply queue up the get()s as well, we could get away with 1 JNI call at the end for each thread instead. Maybe it should be possible for the user to create the queue and then pass it in as an argument to one or more NativeMatrix4 or whatever it’ll be called. So it’d basically be something like this:


//Initialization
for(int i = 0; i < numThreads; i++){
    queues[i] = new Queue();
    matrices[i] = new NativeMatrix4f(queues[i]);
}

//Usage:
NativeMatrix4f matrix = matrices[threadID];
for(int i = ...; ...){
    matrix.translationRotateScale(...).mul(...).get(directBuffer);
}
queues[threadID].execute();

//At the end:
for(int i = 0; i < numThreads; i++){
    queues[i].dispose();
}

EDIT: I have a feeling that for the arguments to translationRotateScale() to work they’d also need to use special classes?

KaiHH · July 14, 2015, 12:19pm

Thanks for your hints about your usage scenarios!
These are really helpful for me, since I can then anticipate a possible solution to satisfy those.
The more I play with this the more I want the “NativeMatrix4f” class to be as lightweight and close-to-the-metal as possible.
My current favourite solution is like this:

the client only works with direct ByteBuffers/FloatBuffers (possibly hidden by that NativeMatrix4f)
the NativeMatrix4f accepts as a constructor argument the ByteBuffer also with a possible 16-byte aligned offset into that buffer
this would allow the client to have the best possible control over allocations and would allow JOML to perform the best possible computations
native code does not allocate any memory representing a vector or matrix (it just allocates and writes executable memory for the JIT’ted functions)
the client has control over the JIT: It can give it an opcodes ByteBuffer and JOML provides an API to construct the native function based on that
again, the NativeMatrix4f or another class would provide a simple builder interface for constructing those native functions based on API calls on that matrix object
I would now not mark any method as “terminal operation” but provide a public terminate() method on those builder classes, which the client can then invoke
this means that, yes, we can make the get(Buffer) methods also lazy.
so in the end we would have many “streams”/sequences of possible operations operating on matrices, vectors and quaternions, which are presented as direct ByteBuffer/FloatBuffers
the API of the builder classes can remain the same like they are now (including translationRotateScale). Those would only “pack” the arguments into an “arguments” ByteBuffer using some defined protocol
if the client does not wish to use those builder classes, it can of course always generate the arguments ByteBuffer itself

KaiHH · July 14, 2015, 4:35pm

Having hand-written (surely suboptimal) SSE code for 4x4 matrix multiplication in place, I did a benchmark with a bulk operation of 100 matrix multiplications.
The bulk was executed 1000 times and compared to a loop of 100,000 iterations with classic Matrix4f.mul(Matrix4f).
The result was: 12045.817 µs classic JOML versus 2639.805 µs JNI/SSE.
So an increase of almost 360% faster!
There is however a very very delicate limit on the size of generated code to stay in L1 cache, it seems.
If we do more than those 100 bulk matrix operations, the performance goes down to around 150% faster, because I emit the 100 matrix multiplications linearly in memory (which becomes quite big), instead of (what I should be doing) emitting the code of each operation just once and then unconditionally jump for each operation into the relevant code for that operation.
I will do that now.

theagentd · July 14, 2015, 4:49pm

Hmm. I was just going to say that unless there’s a significant gain from adding this kind of complexity it would not be worth it. I’ve tried libraries like Riven’s MappedObject for better memory locality (although not for WSW), and even though the performance gains were in some case very significant (sometimes 3x) it was simply not worth it due to the added complexity of doing pretty much anything. A 4.5x faster matrix multiplication method would however be nice to have, but the added complexity has to be optional. I wouldn’t mind spending some time to implement native matrices for my skeleton animation for example, but I’m not going to want to do the same for every single view matrix calculation I have. As an additional high-performance alternative for performance critical parts like bone calculations it does however sound worth it.

How does the performance of 4x3 “matrix multiplications” look with SSE? Would other methods be possible to optimize with such instructions as well? Specifically the performance of translationRotateScale() would be interesting for me, but there might be others that are of use for others as well. Might not be worth looking into too much though, but if you got the matrix multiplications 350% faster translationRotateScale() may end up being the bottleneck for me in the end.

Your previous post was a bit confusing to me. If I understood it right, you’re saying that I could create a small native function by giving JOML an opcode buffer which is then somehow compiled to a native function? ??? So… Software compute shaders for Java? O_o

Oh, and one more thing. There’s an annoying quirk when working with GLSL. A mat4x3 is a pretty bad GLSL type as it is treated as 4 vec3s by the compiler, which when working with uniforms has the same cost as 4 vec4s. A mat3x4 however is however 3 vec4s, which only counts as 3 uniforms. For this reason, I actually store my 4x3 skinning matrices transposed in a texture buffer and load them using three texture lookups each like this:


	int index = offset + boneIndex*3;
	return transpose(mat3x4(
			texelFetch(sampler, index+0),
			texelFetch(sampler, index+1),
			texelFetch(sampler, index+2)
	) * weight);

The thing is that the transpose() function in GLSL is completely optimized away by all compilers, so this kind of packing has no overhead in the shader.

What I’m getting at is that it’d be useful to have functions that can store a matrix in a buffer as both a 4x4 and a 4x3 matrix, AND transposed versions of those as well.