Java OpenGL Math Library (JOML)

KaiHH · July 20, 2015, 2:28pm

Okay, org.joml:joml:1.4.1-SNAPSHOT and org.joml:joml-mini:1.4.1-SNAPHOT contain the changes to Vector3f and Vector4f. Additionally, there is also a getTransposed in Matrix3f and Matrix4f.
Fetch it from https://oss.sonatype.org/content/repositories/snapshots if you are using Maven; or just from the sources.

cylab · July 20, 2015, 3:11pm

If you now put in some static versions for use with collections and array, that would also be great:


// Create Buffers from Vectors and return the new Buffer
public static FloatBuffer Vector3f.toBuffer(Vector3f[] vectors)
public static FloatBuffer Vector3f.toBuffer(Collection vectors)

// Add Vectors to a Buffer and return it with the position after the added vectors (+last stride)
public static FloatBuffer Vector3f.toBuffer(Vector3f[] vectors, FloatBuffer buffer)
public static FloatBuffer Vector3f.toBuffer(Vector3f[] vectors, FloatBuffer buffer, int position)
public static FloatBuffer Vector3f.toBuffer(Vector3f[] vectors, FloatBuffer buffer, int position, int stride)

// same with a collection of Vectors
public static FloatBuffer Vector3f.toBuffer(Collection<Vector3f> vectors, FloatBuffer buffer)
public static FloatBuffer Vector3f.toBuffer(Collection<Vector3f> vectors, FloatBuffer buffer, int position)
public static FloatBuffer Vector3f.toBuffer(Collection<Vector3f> vectors, FloatBuffer buffer, int position, int stride)

// Create Arrays of Vectors from a Buffer
public static Vector3f[] Vector3f.toArray(FloatBuffer buffer)
//  position and stride are for the buffer
public static Vector3f[] Vector3f.toArray(FloatBuffer buffer, int position)
public static Vector3f[] Vector3f.toArray(FloatBuffer buffer, int position, int stride)

// Create ArrayLists of Vectors from a Buffer
public static List<Vector3f> Vector3f.toCollection(FloatBuffer buffer)
//  position and stride are for the buffer
public static List<Vector3f> Vector3f.toCollection(FloatBuffer buffer, int position)
public static List<Vector3f> Vector3f.toCollection(FloatBuffer buffer, int position, int stride)

// CopyVectors from a Buffer to an existing Array, until it's filled, starting at index 0
public static Vector3f[] Vector3f.toArray(FloatBuffer buffer, Vector3f[] vectors)
//  position and stride are for the buffer
public static Vector3f[] Vector3f.toArray(FloatBuffer buffer, int position, Vector3f[] vectors)
public static Vector3f[] Vector3f.toArray(FloatBuffer buffer, int position, int stride, Vector3f[] vectors)

// Copy Vectors from a Buffer to an existing Collection and return this
public static <T extends Collection<Vector3f>> Vector3f.toCollection(FloatBuffer buffer, T vectors)
// position and stride are for the buffer
public static <T extends Collection<Vector3f>> Vector3f.toCollection(FloatBuffer buffer, int position, T vectors)
public static <T extends Collection<Vector3f>> Vector3f.toCollection(FloatBuffer buffer, int position, int stride, T vectors)

// Same with matrix etc.
...

Also prevent Feature Creep

KaiHH · July 20, 2015, 4:01pm

JOML was designed to be a 3D math library, easily usable with Java/OpenGL bindings such as JOGL or LWJGL, which make use of Java NIO to efficiently interface with native OpenGL.
The last part of the above sentence is also the only reason why JOML even provides NIO FloatBuffer/ByteBuffer conversion methods.
If it wasn’t for NIO being used by Java/OpenGL bindings, then even those conversion methods would not be in JOML.

Therefore I am afraid you gonna have to build those utility methods yourself, as JOML is unlikely to interface with the whole Java Collections API just for converting JOML classes to vectors/lists/arrays/maps/sets and whatnot, since this has little to do anymore with a 3D OpenGL math library, but resembles more an Apache Commons Style Library.

So, if you need such functionality for your projects, you can create a joml-util (or such) library containing those methods.

KaiHH · July 20, 2015, 6:24pm

Had to take today and tomorrow off from work. So I used the time to read about the x86 instruction encoding rules, which are nicely explained here and here, and debugged DynASM’s LUA sources.
My plan is still to build a small x86 and x86-64 assembler for Java as a first step towards a nice “high-level” assembly language for Java. “High-level” in a sense that one can use “variables” with automatic register allocation and one can use the SSE/AVX instructions via intrinsic functions.
The x86 encoding rules are wicked and “historically evolved,” but they can still be schematically layed out in the form of a table, like this awesome site does.
Now I am working with jsoup to auto-generate an encoding algorithm from that table, which will use pattern matching to find the correct encoding rule for e.g. “mov eax, ebx”, because “mov” has various possible encodings based on the operand kinds.
DynASM does the same in they LUA sources, only differently.

KaiHH · July 21, 2015, 12:00pm

@theagentd: NativeMatrix4f.translationRotateScale() is in. Once you finished your WSW Demo 7, you might want to have a look at it.
It times at around 486% faster in 20,000 invocations with cold HotSpot and stabilizes at around 260% faster in 20,000 invocations with warmed HotSpot.
Now I will take care of doing a linux version of all that.
Okay, linux x64 version works. and should be even faster than the win64 version since linux’s calling convention does not seem to have non-volatile registers which must be saved to stack on win64.

theagentd · July 21, 2015, 2:02pm

I’m not entirely sure how to try it out. I could extend the benchmarks I’ve made so far to also test NativeMatrix4f I guess, but I’m not sure how the code works anymore? @__@

KaiHH · July 21, 2015, 2:03pm

Here are functioning JUnit testcases for all so far implemented native methods: https://github.com/JOML-CI/JOML/blob/native-interpreter/test/org/joml/test/NativeMatrix4fTest.java
Prebuilt win64 and linux64 shared libraries are here: https://github.com/JOML-CI/JOML/tree/native-interpreter/native

It works like you layed out your initial API design originally.
Generally, what one wants to do with NativeMatrix4f is to batch as many invocations on it via a given Sequence (given as constructor argument). And at the end, call Sequence.call() to execute that sequence. The .get() methods are also delayed/batched.

theagentd · July 21, 2015, 4:48pm

Bad news. It’s slow.

[tr][td]Library[/td][td]bones per second[/td][/tr]
[tr][td]LibGDX[/td][td]12 452k[/td][/tr]
[tr][td]JOML (mul4x3)[/td][td]31 251k[/td][/tr]
[tr][td]JOML native[/td][td]6 346k[/td][/tr]
[tr][td]JOML native, only sequence building[/td][td]11 339k[/td][/tr]
[tr][td]JOML native, only sequence.call()[/td][td]13 601k[/td][/tr]

It looks hopeless. The native version is 1/5th as fast as the normal version. Just building the sequence is 1/3rd as fast as the normal version, and even if I only benchmark sequence.call() only it’s still less than half as fast as the normal version. Looking at the code, I can see that the argument and opcode buffer building is very inefficient. putArgs() is called 12 times for translationRotateScale(), which checks the buffer size 12 times when once should be enough. I tried to optimize that to see if it made a difference, but only managed to crash Java.exe. Still, it looks hopeless since call() itself is half as fast as the normal version.

EDIT: My test source code: http://www.java-gaming.org/?action=pastebin&id=1310

KaiHH · July 21, 2015, 4:54pm

Many thanks for your test!

One thing though that makes joml-native horendously inefficient is that you are swapping the operated-on matrix with each iteration via


int id = i % BONES;
NativeMatrix4f bone = jomlNativeBoneMatrices[id];

This causes the registers/mem to constantly being synched and that is slow.
Could you try building a Sequence that only operates on a single bone?
But maybe this is not a likely usecase for you?

EDIT: But you are right, I also see that the arguments buffer building is by far the most expensive thing of everything. Should’ve timed that, too. Now with everything counted the native version is only 0,03x as fast as HotSpot. :’(
So in the end: it doesn’t gain a thing.

theagentd · July 21, 2015, 5:30pm

With only 1 bone, JOML goes up to 33 344k and native JOML to 9 290k. Benchmarking only sequence.call() gives 16 284k bones per second, which is still ~half as fast as normal JOML. Argument buffer building is of course unaffected.

My guess is that the only way to benefit from SSE would be to port the entire skeleton animation to native code. That of course means that I might as well write the entire game in C instead. >____>

Still, please don’t be discouraged. JOML still has a huge number of advantages for the average user.

EDIT: Also, the reason LibGDX uses a native function for matrix multiplications could be because they’re faster on Android. I’ve heard that HotSpot is much more flakey there, and it’s possible the overhead of calling a native function is lower (just a guess) so it might simply be an optimization for low-end hardware at the cost of high-end performance.

ra4king · July 22, 2015, 9:55am

I’ve been lurking and watching this thread for a while, and I want to finally post saying that I’m really liking this library after looking through its code for a bit.

Quick questions I had:

Why did you choose to not increment the buffer’s position in the constructors and methods that read from/write to NIO Buffers?
Why do the Matrix/Vector constructors not call the functions their respective methods rather than re-implementing the same operation?
Is a conditional really cheaper than calling ‘dest.set(…)’, especially considering Hotspot inlines aggressively: https://github.com/JOML-CI/JOML/blob/master/src/org/joml/Matrix4f.java#L492

Some inconsistencies (I’ll be making a pull request for these):

Some classes are missing constructors reading from Buffers
Some classes use both ByteBuffer and FloatBuffer and some only FloatBuffer and some missing Buffers outright

cylab · July 22, 2015, 10:02am

There is no HotSpot on Android at all… There was Dalvik 'til Kitkat (4.4) and from Lollipop (5.0) now there is Android Runtime (ART)

KaiHH · July 22, 2015, 10:07am

Thank you, ra4king, for giving JOML a try!

On to your questions:

Why did you choose to not increment the buffer’s position:
This was chosen to be compliant with how LWJGL is doing it. Since those methods are likely being used to get a Matrix4f into a ByteBuffer before uploading it to OpenGL, incrementing the buffer’s position would require the client to do a rewind(), flip(), position(int), clear() on the buffer before handing it to a Java/OpenGL binding.
Why do the Matrix/Vector constructors not call the functions their respective methods:
This was rather some insignificant design detail, but I think I thought about Effective Java’s Item 17, which states that a class should be designed for inheritance, or forbid it. JOML is designed for inheritance (for whichever reason ) and therefore allows to override the Matrix4f and other classes including its methods. This would be dangerous had the constructor called an overridden method. That’s the point of Item 17.
Is a conditional really cheaper than calling ‘dest.set(…)’:
I honestly don’t know. Maybe someone can try it out. Just did it this way, because it reads a little bit better in the case where dest != this.

About the last two points:

Some classes are missing constructors reading from Buffers:
That’s true. If you feel that you need those other classes to also read from buffers in their constructors, then this can be added quickly. Or make a PR.
Some classes use both ByteBuffer and FloatBuffer and some only FloatBuffer:
Yepp. Those changes were rather not done with consistency in mind but came as the requirements for them came. So again, if you need other classes to handle it the same, then we can of course add it.

Thanks!

ra4king · July 22, 2015, 10:19am

Ahh makes sense… although I did not like that design at all either

Design principles getting in the way of clean code! I see the reasoning though, thanks.

Readability should be no excuse for adding a branch to critical code … I’ll write a small benchmark next time i have a chance although I expect the impact is minimal.

Spasi · July 22, 2015, 10:26am

Note that all those increments, rewinds and flips do memory writes. They do have an unnecessary performance cost.

[quote=“ra4king link=topic=35596.msg345977#msg345977 date=1437560359][quote author=KaiHH,post:174,topic:53459”]

Is a conditional really cheaper than calling ‘dest.set(…)’:
I honestly don’t know. Maybe someone can try it out. Just did it this way, because it reads a little bit better in the case where dest != this.
[/quote]
Readability should be no excuse for adding a branch to critical code … I’ll write a small benchmark next time i have a chance although I expect the impact is minimal.
[/quote]
It’s not about inlining. Besides, if you used dest.set(…), then there would be no difference between the two branches.

It’s about semantics. With dest.set(…) you’re ensuring that all arguments will be evaluated before any memory write happens in the destination. Even if the method call is inlined, this invariant will make it all the way to the JITed code. The conditional detects that there’s no aliasing and the semantics can be different: note how this.m00 is read 3 times after dest.m00 has been written to. This results in more efficient code (less CPU registers required).

The increased bytecode size and the branch itself have a cost though, which may negate any benefits. It might be better to offer a version of mul with explicit no-aliasing semantics. The user would be responsible to use it when appropriate.

KaiHH · July 22, 2015, 11:03am

I did not like it first either. But you grow accustomed to it very quickly and then really dislike the other way.

ra4king · July 23, 2015, 3:07am

Memory writes = only the position, limit, and/or mark variables so not really. Also most of my use cases aren’t for uploading 1 Vector/Matrix to GL, it’s for filling up a big buffer of many Vectors/Matrices. It’s much cleaner for me to do:
[icode]
for(MyObject m : objects)
m.position.get(buffer);
buffer.flip();
[/icode]
than
[icode]
int position = 0;
for(MyObject m : objects) {
m.position.get(position, buffer); // assuming Vec4
position += 4;
}
[/icode]

1 flip() after filling my buffer is better than continuously setting the position, and it is especially less error prone in case I add another get(buffer) in there in complex code and forget to adjust the position increment properly.

Spasi:

It’s not about inlining. Besides, if you used dest.set(…), then there would be no difference between the two branches.

It’s about semantics. With dest.set(…) you’re ensuring that all arguments will be evaluated before any memory write happens in the destination. Even if the method call is inlined, this invariant will make it all the way to the JITed code. The conditional detects that there’s no aliasing and the semantics can be different: note how this.m00 is read 3 times after dest.m00 has been written to. This results in more efficient code (less CPU registers required).

The increased bytecode size and the branch itself have a cost though, which may negate any benefits. It might be better to offer a version of mul with explicit no-aliasing semantics. The user would be responsible to use it when appropriate.

Ah you misunderstood me as you worded my intentions exactly. I did notice how this.m00 is read after dest.m00 has been set, which is why the conditional is there to make sure it’s not written to before it is read again in the case the destination is the same as either operands. I wasn’t arguing for the inlining, I was arguing that avoiding a function call by adding a branch is not worth it as the function call will most likely be inlined by Hotspot anyway.

Spasi · July 23, 2015, 6:46am

[quote=“ra4king,post:177,topic:53459”]
There’s probably not a right answer here and comes down to personal opinion. My personal opinion is:

Having to deal with flip() means having to mentally track two variables; position and limit. This is more complex than dealing with position() only.
Having a method call mutate arguments (i.e. changing a buffer’s current position) is fundamentally bad practice and bad API design. I don’t know about you, but I can reason better about method calls that are free of side-effects and any mutations to my objects are explicit.
Last but not least, I’ve spent 12 years of my life reading posts on the LWJGL forum about people forgetting to .flip() a buffer.

[quote=“ra4king,post:177,topic:53459”]
I must still be misunderstanding you. As I explained above, you cannot use a method call there as that would lead to different semantics (and defeat the optimization).

theagentd · July 23, 2015, 8:38am

@Spasi

You always need a position to know where to write. For limit, either you already know the size of your data and have a buffer that has the exact same size so limit==capacity, or your data is smaller than your buffer, in which case you probably NEED a limit to show the number of relevant bytes in the buffer.
That’s how the entirety of NIO works though.
It’s a newbie problem, sure, but knowing that you should always flip after writing to it is easier than having to figure out if you should flip or not. Not writing position just makes you keep track of the position yourself, and possibly set the limit and position at the end with flip anyway. Following NIO’s way is better IMO.

My vote goes to incrementing position for consistency with NIO.

Riven · July 23, 2015, 9:14am

This is exactly what Spasi meant: memory writes to position/limit are relatively expensive, even if they are in L1. It is cheaper to keep track of the position with a local variable. Unless HotSpot already hoisted position/limit into localvars, ofcourse, but this is not guaranteed, and at one point they have to be read from main memory, and eventually the localvars have to be written back to main memory.

It’s a convenience/performance tradeoff.