Java OpenGL Math Library (JOML)

Roquen · July 30, 2015, 8:58am

To repeat myself for the n-th. Understanding the basics of compilers is uber-useful and in ways that isn’t necessarily obvious.

theagentd · July 31, 2015, 8:21pm

Lightly started experimenting with converting WSW and Insomnia to JOML. I will be editing this post as I find features I’m missing. KaiHH, you may want to add me on Skype so we can communicate through chat there (same username as here).

Matrix3f/d.set(Matrix4f/d): copies rotation of 4D matrix.
Matrix4f/d.getTranslation(Vector3f/d): opposite of setTranslation(). Stores result in provided vector.
Matrix4f/d.getScale(Vector3f/d): gets the scale of the X, Y and Z axes. Stores result in provided vector.
Vector3f/d.distanceSquared(Vector3f/d): calculates the squared distance between two points.
Vector3f/d.lerp(Vector3f/d, float/double alpha): does linear interpolation between two vectors.
Vector3f/d.project(Matrix4f/d): same as mul(Matrix4f/d), but also calculates resulting w value and divides by it at the end. Proposed implementation:

    public Vector3d prj(Matrix4d mat, Vector3d dest) {
    	double w = mat.m03 * x + mat.m13 * y + mat.m23 * z + mat.m33;
        if (this != dest) {
            dest.x = (mat.m00 * x + mat.m10 * y + mat.m20 * z + mat.m30) / w;
            dest.y = (mat.m01 * x + mat.m11 * y + mat.m21 * z + mat.m31) / w;
            dest.z = (mat.m02 * x + mat.m12 * y + mat.m22 * z + mat.m32) / w;
        } else {
            dest.set((mat.m00 * x + mat.m10 * y + mat.m20 * z + mat.m30) / w,
                     (mat.m01 * x + mat.m11 * y + mat.m21 * z + mat.m31) / w,
                     (mat.m02 * x + mat.m12 * y + mat.m22 * z + mat.m32) / w);
        }
        return this;
	}

<<>>: Vector3d.mul(Matrix4d mat, Vector3d test) DOES NOT ADD THE TRANSLATION! It does NOT assume that w=1.0 despite the JavaDoc saying so! Fixed version:


    public Vector3d mul(Matrix4d mat, Vector3d dest) {
        if (this != dest) {
            dest.x = mat.m00 * x + mat.m10 * y + mat.m20 * z + mat.m30;
            dest.y = mat.m01 * x + mat.m11 * y + mat.m21 * z + mat.m31;
            dest.z = mat.m02 * x + mat.m12 * y + mat.m22 * z + mat.m32;
        } else {
            dest.set(mat.m00 * x + mat.m10 * y + mat.m20 * z + mat.m30,
                     mat.m01 * x + mat.m11 * y + mat.m21 * z + mat.m31,
                     mat.m02 * x + mat.m12 * y + mat.m22 * z + mat.m32);
        }
        return this;
    }

Vector3f/d.setLength(float/double): normalizes the vector to a given length (lengthArgument/sqrt(xx+yy+z*z)).
Matrixf/d.add(Matrixf/d): sums up each element in the matrices.
Matrixf/d.sub(Matrixf/d): you get it.
Matrixf/d.fma(Matrixf/d): multiply and add version (EXTREMELY useful for skeleton animation).
Matrixf/d.scale(Vectorf/d): scale and scaling function with vector parameter.
Matrixf/d.scaling(Vectorf/d): scale and scaling function with vector parameter.
Matrix*f/d.get() functions that work on ByteBuffers instead of Float/DoubleBuffers.
Vector3f.mul(Matrix3/4d): to multiply float vectors by double matrices.
Vector3f.project(Matrix4d): to project float vectors by double matrices.
A function to normalize the rotation part of a 3D or 4D matrix. This is useful since after generating a normal matrix, the axes are probably not unit.

<<>>: Matrix4d.rotate() resets the scale of the matrix somehow. I can’t figure out how, but it does.


System.out.println(new Matrix4d().scaling(0.0000001).rotate(0, 0, 1, 0));
/*
WRONG:
  1,000E+0  0,000E+0  0,000E+0  0,000E+0
  0,000E+0  1,000E+0  0,000E+0  0,000E+0
  0,000E+0  0,000E+0  1,000E+0  0,000E+0
  0,000E+0  0,000E+0  0,000E+0  1,000E+0
*/

System.out.println(new Matrix4d().rotation(0, 0, 1, 0).scale(0.0000001));
/*
CORRECT:
  1,000E-7  0,000E+0  0,000E+0  0,000E+0
  0,000E+0  1,000E-7  0,000E+0  0,000E+0
  0,000E+0  0,000E+0  1,000E-7  0,000E+0
  0,000E+0  0,000E+0  0,000E+0  1,000E+0
*/

Vector3f/d.rotate(Matrix4f/d): Multiplies a vector by only the rotational part of a 4D matrix (= assume w=0) (= what mul() does now due to the bug).

<<>>: The Matrix4d.normal(Matrix3d dest) fast path produces incorrect results. Commenting out the fast path causes the function to calculate correct normal matrices.

<<>>: The Quaternionf.slerp() tends to produce NaN every now and then. I replaced the implementation with that of LibGDX and my performance sky-rocketed. You DEFINITELY need a better implementation.

theagentd · August 1, 2015, 4:39am

Preliminary performance results:

LibGDX: 48 FPS
JOML: 68 FPS

Skeleton animation seems to be almost exactly twice as fast. JOML uses slerp() from LibGDX. Once I integrate the frustum culling optimizations JOML had I should be able to cram out a tiny bit more performance.

KaiHH · August 2, 2015, 11:59am

Thanks for your detailed feature requests!

So far the following has been implemented:

Matrix3. and set() taking Matrix4 - many thanks to @ra4king for contribution!
Matrix4.getTranslation()
Vector3.distanceSquared()
Vector3.lerp()
Vector3.mulProject() - I did not want to name it “project” since that can be confused with Matrix4.project, which also takes viewport into account
Matrix.scale/scaling with vector parameter
Missing Matrix.get(ByteBuffer) methods
Matrix4.normalize3x3()

The following has been fixed (thanks for spotting!):

Vector3.mul(Matrix4)
Matrix4d.rotate()
Matrix4.normal(Matrix3) should be fixed now. Fast path should be even faster now.

The following has been improved:

I rewrote the implementation of Quaternion.slerp(). Please try.

For the following I would like to propose “workarounds”:

Vector3.setLength(factor) - Please use v.normalize().mul(factor)

For the following I have further questions:

Matrix.add/sub/fma - I originally removed those methods from JOML since I did not see any geometrical meaning to component-wise summing/diffing two matrices. Maybe you could explain why you need those. Probably you want to use those to compose the translation of two matrices (one of which is otherwise zero everywhere else) together?

theagentd · August 2, 2015, 1:54pm

Vector3.setLength(factor) would be faster than v.normalize().mul(factor). In GLSL-speak: (vector.xyz / length * factor) is 6 instructions. vector.xyz * (factor/length) is only 4. It’s a small detail, I guess. Something that would be more useful would be a Vector3*.clampLength(float maximum), which makes sure that the length of the vector isn’t above the given value. I use that for clamping motion vectors and velocity vectors to a maximum length.

Matrix.fma() is probably not as useful as it was with LibGDX, but it’s still my preferred way of doing skeleton animation. In skeleton animation you have up to 4 bone matrices and 4 bone weights. With some GLSL syntax to make it clearer what’s happening:


Vector3f inputPosition = ...;
Vector3f result = new Vector3f();
for(int i = 0; i < boneCount; i++){
    result += (boneMatrix[i] * inputPosition) * boneWeight[i];
}

The boneMatrixinputPosition can be done using a 4x3 multiplication. We do 3 dot products (3 muls + 2 adds) plus add a translation, which equals 63=18 instructions per 4x3 matrix multiplication. We also multiply by the bone weight, another 3 muls, and add the result to the result vector, for a total of 18+3+3 = 24 instructions per bone. That’s a total of 96 instructions per vertex.

Another way of calculating this is to do something like this:


Vector3f inputPosition = ...;
Matrix4f skinningMatrix = new Matrix4f().zero();
for(int i = 0; i < boneCount; i++){
    skinningMatrix += boneMatrix[i] * boneWeight[i]; //Scalar multiply of matrix
}
Vector3 result = skinningMatrix * inputPosition;

This translates to 16 muls with the bone weight plus 16 adds to the skinning matrix per bone plus 18 instructions for the final multiply at the end, for a total of 146 instructions. Why would you do this? Normals. If you also have normals, the first one needs to do another 4 matrix multiplies for a total of 8. In the second one, you can reuse the calculated skinning matrix for both positions and normals, so in total you do 4 matrix fma()s and 2 bone multiplies, which is faster (192 vs 164 instructions). In addition, as you know LibGDX was not as fast as JOML so this actually translated to a really big performance increase.

Fun fact: GPUs actually have a MULADD (fma) instruction to do a multiply and add in the same clock cycle. A matrix4x3 * vec4(x, y, z, 1.0) multiply therefor only takes 9 instructions to accomplish. The same works for fma’ing matrices together as the weight multiplication can be done together with the summing. In addition, I use a matrix4x3 for the skinning matrix. On a GPU, this means that (with normal transformation) the first technique takes 96 instructions vs 66 instructions for the second technique.

KaiHH · August 2, 2015, 2:15pm

So, if I understand you correctly, you would like to have a method like the following?


Matrix4f fma4x3(Matrix4f other, float factor) {
  m00 += other.m00 * factor;
  m01 += other.m01 * factor;
  m02 += other.m02 * factor;
  m10 += other.m10 * factor;
  ...
  m32 += other.m32 * factor;
  return this;
}

EDIT:

Matrix4.add/.add4x3, .sub/.sub4x3, .mulComponentWise/.mul4x3ComponentWise and .fma4x3 are implemented (with usual ‘this’ and ‘dest’ variants).
Regarding naming, the component-wise mul method is the only one being a bit of an “outsider”, because proper matrix multiplication with “mul” and the same signature already existed.

theagentd · August 2, 2015, 2:18pm

Yep! That looks perfect! Functions to add, subtract and multiply values per element would be useful too, both normal and 4x3 versions.

theagentd · August 5, 2015, 11:32pm

Sorry, been busy with WSW’s release. I will pick this up again once things have calmed down, but it’ll probably take a week or two. We haven’t abandoned it; the decision to switch over is still made.

KaiHH · August 6, 2015, 7:13am

Yeah, no worries.
Get to it whenever you have time. I’m on vacation for the next two weeks, so I too won’t be able to do anything on it during that time.

Roquen · August 6, 2015, 8:41am

I’m bored (raining…boo), so some comments:

toString - I assume you want an exact representation - {Float/Double}.toHexString
rescaling by multiple divides, multiple by recip…the extra rounding step isn’t a worry.
likewise for methods over single types which promoting to doubles…do you really care about the extra rounding steps?
compositing of operations which are disallowed transforms for the compiler.
arccos and arcsin are always removable.
you might want to let the various inliners do the work for you. Smaller classfiles, lower load/link/verify times and the first compile will happen faster (assuming more than one variant is actually used in a project). Examples various normalize() and normalize(…)
although ugly, static methods with manipulate values in a float/double array at offsets.
2d vectors - maybe abuse the math and include complex number functions (1).
slerp/nlerp - let the user pass in the closest pair (forget checking the dot and branch)
nlerp - renormalize instead of normalize.
branching to insure input != output for most types is a model based on when hardware had few registers. This is not longer the case (like haswell has 168 AVX registers, sandy bridge 144). So instead compute results to local vars and store to dest.

I can quickly bang out some missing functionality…but minimally I won’t have the free time to do so for at least 4 weeks.

Roquen · August 6, 2015, 9:06am

Wait. A bunch of the quaternion operations are backwards. That’s an incredibly awful idea.

KaiHH · August 6, 2015, 9:42am

I am sorry, but I think you express your points in (vague) ways that make deriving concrete plans for changes, or seeing any benefit or added value, difficult to me.
Do you have a concrete project you want to realize with JOML, and have problems (missing functionality or bugs) that hinder you from using JOML for this?
I am unlikely to change something unless that actually adds value. And by “value” I mean: You (or anyone else) being able to use JOML more effectively and efficiently.
You know, what I observed of frameworks and applications being successful and standing the test of time, is:

that they derive their design by actual use cases and user needs,
that they adapt to new or changing requirements quickly and
that their design is not driven by theoretical considerations of what could be useful or what might be a better design.
There are a few more points than that, such as being easy to comprehend and consistent, all of which I try to target with JOML.
After all, the quality of a design decision is measured by how effectively and efficiently existing use cases can be satisfied, IMO.
Speaking of consistency: Maybe you can hint me to a concrete operation in the quaternion classes that are “backwards” and we can work on fixing that?

Roquen · August 6, 2015, 12:06pm

I need to know which points are vague to be able to clarify.

All of my comments in this thread address actual use cases in a video game context.

I’ll address backwards later.

KaiHH · August 6, 2015, 12:55pm

Okay, let me make the effort and comment on each of your points, which I’d like you to clarify on:

[quote]toString - I assume you want an exact representation - {Float/Double}.toHexString
[/quote]
Why do you assume? The question I would ask is, do YOU want another representation than what is currently done?
The concrete use case and reason why the current solution was chosen, was to have matrices and their columns align for easy sysout and debugger variable inspection debugging, no matter what magnitude a floating point value has.

[quote]rescaling by multiple divides, multiple by recip…the extra rounding step isn’t a worry.
[/quote]
What? Who “rescales” by multiple divides and where? And why? Be more concrete! In the best case please point out a concrete line in the sources of JOML that has an issue, so that we can fix it.

[quote]likewise for methods over single types which promoting to doubles…do you really care about the extra rounding steps?
[/quote]
Again, the question is: Do YOU care about how the solution works currently and do YOU have a problem with how it is currently? If not, then this statement is mood.

[quote]compositing of operations which are disallowed transforms for the compiler.
[/quote]
What? Please concrete examples. This is waaaay too vague of a statement than what I can process in my mind
It does not lead to a concrete desired change.

[quote]arccos and arcsin are always removable.
[/quote]
How? And where? And why?

[quote]you might want to let the various inliners do the work for you. Smaller classfiles, lower load/link/verify times and the first compile will happen faster (assuming more than one variant is actually used in a project). Examples various normalize() and normalize(…)
[/quote]
Again: what? What is an ‘inliner’ in the context of your post? What benefit does it add? How is it being used? Questions that must be answered before thinking about a way to solve whatever the problem is that you are addressing.

[quote]although ugly, static methods with manipulate values in a float/double array at offsets.
[/quote]
What is with “static methods”? Do you propose to ADD them? There are already methods to get/put 4x4 matrices into/from a float/double array at an offset.

[quote]2d vectors - maybe abuse the math and include complex number functions (1).
[/quote]
Why? Who would benefit from it in which concrete example?

[quote]slerp/nlerp - let the user pass in the closest pair (forget checking the dot and branch)
[/quote]
What is a “closest pair?” Do you mean the threshold of the dot product after which two quaternions are considered to be amenable to non-spherical linear interpolation?

[quote]nlerp - renormalize instead of normalize.
[/quote]
Where to renormalize and what?

[quote]branching to insure input != output for most types is a model based on when hardware had few registers.
[/quote]
Did you actually measure a performance degradation because of this and experience performance issues?

Riven · August 6, 2015, 1:14pm

Going the extra mile… impressive. A disoriented woman once said: being right is fruitless when one fails to convey.

KaiHH · August 6, 2015, 1:20pm

Yes, the thing is, I do believe that @Roquen has some very good points to say and can propose improvements.
I just want to give some guidance on how that must be, yes - conveyed, for me to understand.

theagentd · August 6, 2015, 6:46pm

Something I do find a bit unintuitive and “backward” is that some functions seem to return the input for chaining instead of the destination object. For example, Quaternion*.transform(Vector3*) returns the quaternion, not the resulting Vector3. That goes against both LibGDX and LWJGL and is in my opinion less useful than returning the resulting vector.

For example, with LibGDX I did this:


parentOrientation.transform(translation).scl(parentScale).add(parentTranslation);

With JOML, I need to split up the command like this:


parentOrientation.transform(translation); //returns parentOrientation
translation.mul(parentScale).add(parentTranslation);

KaiHH · August 6, 2015, 6:48pm

Yes, this is being discussed currently on GitHub: https://github.com/JOML-CI/JOML/issues/43
and very likely to change in the near future.

theagentd · August 6, 2015, 6:53pm

I hate to bring up this issue again since I fear a shitstorm is going to happen again, but are the with() functions really of any use? They only increase the amount of code you have to write to accomplish the same thing.


parentOrientation.transform(translation); translation.mul(parentScale).add(parentTranslation);

parentOrientation.transform(translation).with(translation).mul(parentScale).add(parentTranslation);

I see no possible use case of it where you can’t just replace [icode]with(X).[/icode] with just [icode]; X.[/icode]

EDIT: I’m aware that this use-case will also be made redundant after the always-return-modified-object convention.

Roquen · August 6, 2015, 11:41pm

Remember that I’m stuck with just a cell phone. I’m looking at your code through a web interface on a cell’s screen and typing is less than ideal.

Why do you assume? The question I would ask is, do YOU want another representation than what is currently done?
The concrete use case and reason why the current solution was chosen, was to have matrices and their columns align for easy sysout and debugger variable inspection debugging, no matter what magnitude a floating point value has.
[/quote]
What I intended to say was “Assuming that…”, the case under that assumption is text based serialization where error-free round trips are a very good thing.

What? Who “rescales” by multiple divides and where? And why? Be more concrete! In the best case please point out a concrete line in the sources of JOML that has an issue, so that we can fix it.
[/quote]
Rx = x/K for ‘n’ x’s, fixed K. Transform into s = (1/k) and multiply by s. Adds one rounding step and removes ‘n-1’ long latency operations and replaces with short latency. Examples include normalization. To generalize, java cannot logically factor an FP expression and in this case kills sub-expression removal.

Again, the question is: Do YOU care about how the solution works currently and do YOU have a problem with how it is currently? If not, then this statement is mood.
[/quote]
I think the real question is do you or your potential users care? But back on topic. Take Quaternionf.setAxisAngle, it’s promoting three values to doubles, performing a multiply on each and converting the three back to float. One less rounding step for 5 extra operations and adding two to the dependency chain (which is doubling it).

What? Please concrete examples. This is waaaay too vague of a statement than what I can process in my mind
It does not lead to a concrete desired change.
[/quote]
Java doesn’t allow relaxing of FP so the only transforms it can perform on a FP expression is one that returns the same result. (some minor exceptions, but not interesting here). Like that it cannot factorize (as above) it also cannot re-associate.

a+1+1 != a+2
a+b+c != a+c+b
abc != bca

There appear to be numerous examples of this. Like terms abc and dbc are computed. b*c cannot be computed once.

How? And where? And why?
[/quote]
How = trig identities…worst case atan2.
Where = search for Math.acos
Why = asin/acos are hard to approximate (read slow) in comparison to atan/atan2 and very frequently disappears.

atan, atan2 & sqrt are compiler intrinsics…asin/acos are not.

Additionally they kinda suck. The suck because an error of 1-ulp can make them go BOOM! Like say ‘w’ of a quaternion is +/-(1+ulp(1)) angle returns NaN. Input is from dotting two nearly co-linear vectors…BOOM!

Again: what? What is an ‘inliner’ in the context of your post? What benefit does it add? How is it being used? Questions that must be answered before thinking about a way to solve whatever the problem is that you are addressing.
[/quote]
What = HotSpot’s inlining functionality (I think I did a write-up somewhere)

What is with “static methods”? Do you propose to ADD them? There are already methods to get/put 4x4 matrices into/from a float/double array at an offset.
[/quote]
All input and output is in flat arrays from offsets.

Why? Who would benefit from it in which concrete example?
[/quote]

http://www.java-gaming.org/topics/vectors-what-s-the-point/24307/view.html

So animation systems, simulations, geometry, increased performance by not using trig & inverse trig. All kinds of good stuff.

What is a “closest pair?” Do you mean the threshold of the dot product after which two quaternions are considered to be amenable to non-spherical linear interpolation?
[/quote]
I’m suggesting that the contract should be that two input should be of unit length and their dot product >= 0. For a given pair of input, the slerp or nlerp will run many times…they caller can figure out the closest pair once and supply that.

For ‘nlerp’ the dot product and the branch go away. That a drastic complexity decrease.

Where to renormalize and what?
[/quote]
Come on. I’m taking a fuck load of time to answer your questions on stupid small screen. nlerp is only a couple of lines of code (and if you take my suggestion above it’ll be cut in half). What do you mean by “what” and “where”? How many calls to normalize is there in function? ONE! Assuming you don’t understand what I mean by “renormalize”: the magnitude of lerp will be on [1,.707]. 1 step of Newton is sufficient.

Did you actually measure a performance degradation because of this and experience performance issues?
[/quote]
No. I don’t need to.