I posted it above.
http://www.java-gaming.org/?action=pastebin&id=1307
It’s only faster when the distance test is used first, and when the level is significantly bigger than the frustum so the distance test actually culls anything.
I posted it above.
http://www.java-gaming.org/?action=pastebin&id=1307
It’s only faster when the distance test is used first, and when the level is significantly bigger than the frustum so the distance test actually culls anything.
Quaternion functions that implicitly normalizes something:
I completely think that normalizing should be the user’s responsibility. In practice all functions expect normalized inputs, so it’s hardly difficult to for the user to keep track of. Those extra unnecessary normalizations do cost some performance after all.
Why? Do you like having uninterpolated or distorted bone animation?
By no implicit normalization I only referring to operations on quaternions. Conversions in & out need to do the correct thing. ‘unitInvert’ just needs to be properly named, but the conjugate is an axiomatic function.
The difference function is another poorly named function and no normaliztion should be performed. Of more general interest would be cmul: AB and mulc AB, which covers and is much more efficient that this.
There are a large number of useful functions I could suggest. Off the top of my head:
Cayley (stereographic) & inverse projections
unit log
pure exp
orientation change: get Q that rotates unit vector A into unit vector B (this reduces further if actually a matrix is desired)
unit square root
unit square
Having a function called div is IMO awkward since there’s right and left versions, but since most people only care about unit quaterions I probably wouldn’t include either. Those that do care will most of the time be able to carry through the derivation of the total desired operation.
Remember that there’s absolutely nothing special about unit quaternions. Non-unit quaternions have uses.
On slerp: Implementations like this have zero use cases. I’m too lazy to have this conversation from my cell phone though. Basically slerp is just linearly parameterizing an arc length (2D) and the angle is no more than PI/2 for random input. It’s always reducible to lerp. The only question is the method of reduction. If you want a quick answer try twitting @rygorous
Ok but only the unit quaternions represent rotations.
Not true. All quaternions other than zero represent a rotation.
We’ve decided to make the move to JOML for Insomnia after the release of Demo 7.
Glad to hear that! If there is anything we can do to make JOML easier or faster for you, please let us know!
Additionally, I am investigating a possible performance improvement for JOML using native SSE code. See this enhancement issue: https://github.com/JOML-CI/JOML/issues/30
I’d be happy to hear about any suggestions you have on this.
By the way: org.joml:joml:1.4.0
and org.joml:joml-mini:1.4.0
are now on Maven Central (the first release on Maven Central actually).
Intermediary snapshot releases will still be available on Sonatype’s snapshot repository:
https://oss.sonatype.org/content/repositories/snapshots and the next joml release on Central will likely be 1.5.0 in two weeks.
There are much bigger gains to be had than sse atm. Notably lowering memory stalls. These two go hand in hand. The latter needs to be addressed before the former has what it needs. And as mentioned previously there’s quite a bit of reworking that can be performed.
@Roquen, I really like your comments.
Btw. aren’t you supposed to be on vacations by now? So why are you on your laptop/mobile?
Enjoy the beach/sun/mountains/culture or wherever it is you are…
Now, I like your comments because they fall in line with either of the following:
a) this is bad
b) that should not be done this way
c) people would never want to use this
Everytime I read one of your comments, I can categorize them in either a), b) or c).
That makes it really easy for me, thanks!
But seriously now: I would happily implement any suggestion you have, if they:
Reference (page 4):
http://www.cs.ucr.edu/~vbz/resources/quatut.pdf
[quote] If N(q) = 1, then q = [vˆ sin W, cos W] acts to rotate around unit axis vˆ by 2W
[/quote]
[quote]we can henceforth assume q is a unit quaternion
[/quote]
[quote] it has to be a unit quaternion
[/quote]
Edit.: Sorry for the off topic.
Split quaternion topic non specific. I am on vacation cell only so more terse than usual.
I can back up all my commentary…ask for specifics if my reasoning isn’t clear.
Just one info for high-performance people like @theagentd.
I implemented a simple runtime JIT code generator using DynASM, which currently supports Matrix4f.mul(Vector4f), making use of SSE instructions (movaps, shufps, mulps and addps).
The astonishing results are: Even with JNI overhead this function is 8% faster than the corresponding scalar Java code on an i3-2120…
(I did a benchmark with 100 million invocations). Resulting numbers were the same, but the JNI/SSE version was faster.
So JOML is going SSE and will get an optional acceleration JNI library!
The real wins should be in bulk-operations within one JNI call - any numbers on that?
Yepp. That’s what I am after in the long run, and why I am using DynASM runtime code-generation. But I am really amazed now, how much faster even a single non-batched operation is.
I will update here when I get to implement the batching soon.
If you go that route remember you’ll be needing to provide precompiled binaries for x86 and amd64, for Windows, Linux and MacOS
Cas
Don’t forget ARM6 and ARM7, as this might be valuable on Android :point:
Something that would maximize the throughput would be the ability to queue up get()s as well. In my case I will only be doing 3 functions:
[icode]
matrix.translationRotateScale(…).mul(…).get(directBuffer);
[/icode]
This tiny code snippet might be run 100 000 times per frame though. That’s still 100 000 JNI calls. If I could simply queue up the get()s as well, we could get away with 1 JNI call at the end for each thread instead. Maybe it should be possible for the user to create the queue and then pass it in as an argument to one or more NativeMatrix4 or whatever it’ll be called. So it’d basically be something like this:
//Initialization
for(int i = 0; i < numThreads; i++){
queues[i] = new Queue();
matrices[i] = new NativeMatrix4f(queues[i]);
}
//Usage:
NativeMatrix4f matrix = matrices[threadID];
for(int i = ...; ...){
matrix.translationRotateScale(...).mul(...).get(directBuffer);
}
queues[threadID].execute();
//At the end:
for(int i = 0; i < numThreads; i++){
queues[i].dispose();
}
EDIT: I have a feeling that for the arguments to translationRotateScale() to work they’d also need to use special classes?
Thanks for your hints about your usage scenarios!
These are really helpful for me, since I can then anticipate a possible solution to satisfy those.
The more I play with this the more I want the “NativeMatrix4f” class to be as lightweight and close-to-the-metal as possible.
My current favourite solution is like this:
Having hand-written (surely suboptimal) SSE code for 4x4 matrix multiplication in place, I did a benchmark with a bulk operation of 100 matrix multiplications.
The bulk was executed 1000 times and compared to a loop of 100,000 iterations with classic Matrix4f.mul(Matrix4f).
The result was: 12045.817 µs classic JOML versus 2639.805 µs JNI/SSE.
So an increase of almost 360% faster!
There is however a very very delicate limit on the size of generated code to stay in L1 cache, it seems.
If we do more than those 100 bulk matrix operations, the performance goes down to around 150% faster, because I emit the 100 matrix multiplications linearly in memory (which becomes quite big), instead of (what I should be doing) emitting the code of each operation just once and then unconditionally jump for each operation into the relevant code for that operation.
I will do that now.
Hmm. I was just going to say that unless there’s a significant gain from adding this kind of complexity it would not be worth it. I’ve tried libraries like Riven’s MappedObject for better memory locality (although not for WSW), and even though the performance gains were in some case very significant (sometimes 3x) it was simply not worth it due to the added complexity of doing pretty much anything. A 4.5x faster matrix multiplication method would however be nice to have, but the added complexity has to be optional. I wouldn’t mind spending some time to implement native matrices for my skeleton animation for example, but I’m not going to want to do the same for every single view matrix calculation I have. As an additional high-performance alternative for performance critical parts like bone calculations it does however sound worth it.
How does the performance of 4x3 “matrix multiplications” look with SSE? Would other methods be possible to optimize with such instructions as well? Specifically the performance of translationRotateScale() would be interesting for me, but there might be others that are of use for others as well. Might not be worth looking into too much though, but if you got the matrix multiplications 350% faster translationRotateScale() may end up being the bottleneck for me in the end.
Your previous post was a bit confusing to me. If I understood it right, you’re saying that I could create a small native function by giving JOML an opcode buffer which is then somehow compiled to a native function? ??? So… Software compute shaders for Java? O_o
Oh, and one more thing. There’s an annoying quirk when working with GLSL. A mat4x3 is a pretty bad GLSL type as it is treated as 4 vec3s by the compiler, which when working with uniforms has the same cost as 4 vec4s. A mat3x4 however is however 3 vec4s, which only counts as 3 uniforms. For this reason, I actually store my 4x3 skinning matrices transposed in a texture buffer and load them using three texture lookups each like this:
int index = offset + boneIndex*3;
return transpose(mat3x4(
texelFetch(sampler, index+0),
texelFetch(sampler, index+1),
texelFetch(sampler, index+2)
) * weight);
The thing is that the transpose() function in GLSL is completely optimized away by all compilers, so this kind of packing has no overhead in the shader.
What I’m getting at is that it’d be useful to have functions that can store a matrix in a buffer as both a 4x4 and a 4x3 matrix, AND transposed versions of those as well.