Java OpenGL Math Library (JOML)

KaiHH · August 18, 2015, 4:58pm

Okay, then. That’s fine by me.

One thing, though. I tried your nlerpRecursive with this = (x=0, y=0, z=0, w=1) and q = (0, 1, 0, -4.371E-8).
The -4.371E-8 is due to inaccuracies when building a rotation of 180 degrees around the Y axis, like so: Quaternionf q = new Quaternionf().rotateY((float) Math.PI)
This resulted in the normalization factor s (that 1.0/sqrt(...) in nlerp) to become NaN in the first iteration of your while loop and then the whole result became NaN.

EDIT: Ah, it was an error on my side when translating and inlining that.
Now, the final version is:


    public Quaternionf nlerpIterative(Quaternionf q, float alpha, float dotThreshold, Quaternionf dest) {
        float q1x = x, q1y = y, q1z = z, q1w = w;
        float q2x = q.x, q2y = q.y, q2z = q.z, q2w = q.w;
        float dot = q1x * q2x + q1y * q2y + q1z * q2z + q1w * q2w;
        float alphaN = alpha;
        while (Math.abs(dot) < dotThreshold) {
            float scale0 = 0.5f;
            float scale1 = dot >= 0.0f ? alphaN : -alphaN;
            if (alphaN < 0.5f) {
                q2x = scale0 * q2x + scale1 * q1x;
                q2y = scale0 * q2y + scale1 * q1y;
                q2z = scale0 * q2z + scale1 * q1z;
                q2w = scale0 * q2w + scale1 * q1w;
                float s = (float) (1.0 / Math.sqrt(q2x * q2x + q2y * q2y + q2z * q2z + q2w * q2w));
                q2x *= s;
                q2y *= s;
                q2z *= s;
                q2w *= s;
                alphaN = alphaN * 2.0f;
            } else {
                q1x = scale0 * q1x + scale1 * q2x;
                q1y = scale0 * q1y + scale1 * q2y;
                q1z = scale0 * q1z + scale1 * q2z;
                q1w = scale0 * q1w + scale1 * q2w;
                float s = (float) (1.0 / Math.sqrt(q1x * q1x + q1y * q1y + q1z * q1z + q1w * q1w));
                q1x *= s;
                q1y *= s;
                q1z *= s;
                q1w *= s;
                alphaN = alphaN * 2.0f - 1.0f;
            }
            dot = q1x * q2x + q1y * q2y + q1z * q2z + q1w * q2w;
        }
        float scale0 = 1.0f - alphaN;
        float scale1 = dot >= 0.0f ? alphaN : -alphaN;
        dest.x = scale0 * q1x + scale1 * q2x;
        dest.y = scale0 * q1y + scale1 * q2y;
        dest.z = scale0 * q1z + scale1 * q2z;
        dest.w = scale0 * q1w + scale1 * q2w;
        float s = (float) (1.0 / Math.sqrt(dest.x * dest.x + dest.y * dest.y + dest.z * dest.z + dest.w * dest.w));
        dest.x *= s;
        dest.y *= s;
        dest.z *= s;
        dest.w *= s;
        return dest;
    }

Seems to work.

Riven · August 18, 2015, 5:48pm

What about this optimisation:

Would it be applicable, or would it affect accuracy too much?
Maybe it is not even faster in HotSpot and/or on modern CPUs.

As another note on performance: I wouldn’t write into each field of dest twice. Only write the final 4 values into the object. You can use localvars to allow HotSpot to be more aggressive.

Roquen · August 18, 2015, 8:15pm

Your intuition is correct. The missing link is half the angle going into quaternion space and double it coming out: an angle of Pi/2 in quaternion space translates into Pi in 3D, like you expected.

On fast 1/sqrt. Hardware can do this faster, but java doesn’t give access to it. lerp always is too small except at the endpoints. Worst case at t=.5. Newton’s method with reasonable guess.

Lerp and nlerp have the same angular error. You can have to change how you measure it. BUT likewise whatever takes the output likewise needs to be aware of non-unit input. Let’s leave this for another day.

theagentd · August 19, 2015, 6:02am

@KaiHH
Your implementation has a bug. Line 8 should be:

            float scale1 = dot >= 0.0f ? 0.5 : -0.5;

theagentd · August 19, 2015, 7:04am

Optimizations:

In the float versions of classes during normalization, you often do this:

float inverseLength = (float)(1.0 / Math.sqrt(...));

As far as I know, the only operation that is actually slower on doubles than floats is division, so

float inverseLength = 1.0f / (float)Math.sqrt(...);

should be faster. HotSpot might even be able to compute sqrt() at float precision, but don’t quote me on that.

I noticed that for dots extremely close to 1.0 or -1.0, slerp() was actually 50% faster than both nlerp and rlerp due to not doing a normalization. Since this is a pretty common case in skeleton animation (a bone that simply isn’t animated), it seems worth to optimize for. However, it is pretty pointless to even lerp at such small angles, so simply returning itself is both faster and has almost no error in the first place:


    public Quaternion nlerpIterative(Quaternion q, double alpha, double dotThreshold, Quaternion dest) {
        double q1x = x, q1y = y, q1z = z, q1w = w;
        double q2x = q.x, q2y = q.y, q2z = q.z, q2w = q.w;
        double dot = q1x * q2x + q1y * q2y + q1z * q2z + q1w * q2w;
        
        if(1 - 1E-6 < Math.abs(dot)){
        	return dest.set(this);
        }

        ...

When slerping between two (almost) equal quaternions, this results in 35% higher performance than slerp() and 100% higher performance than nlerp and iterative nlerp without this check.

Field writes are more expensive than writes to local variables. At the end of nlerp() and nlerpIterative() this is very slightly faster:


        /*dest.x = scale0 * q1x + scale1 * q2x;
        dest.y = scale0 * q1y + scale1 * q2y;
        dest.z = scale0 * q1z + scale1 * q2z;
        dest.w = scale0 * q1w + scale1 * q2w;
        double s = 1.0 / Math.sqrt(dest.x * dest.x + dest.y * dest.y + dest.z * dest.z + dest.w * dest.w);
        dest.x *= s;
        dest.y *= s;
        dest.z *= s;
        dest.w *= s;
        return dest;*/

        double x = scale0 * q1x + scale1 * q2x;
        double y = scale0 * q1y + scale1 * q2y;
        double z = scale0 * q1z + scale1 * q2z;
        double w = scale0 * q1w + scale1 * q2w;
        double s = 1.0 / Math.sqrt(x*x + y*y + z*z + w*w);
        dest.x = x * s;
        dest.y = y * s;
        dest.z = z * s;
        dest.w = w * s;
        return dest;

Benchmarks:

Completely random quaternions
[icode]quaternion.set(r.nextDouble()*2+1, r.nextDouble()*2+1, r.nextDouble()*2+1, r.nextDouble()*2+1).normalize();[/icode]

nlerp: 84 664 566
slerp: 10 610 483
rlrp50: 51 548 933 (threshold: 0.50)
rlrp80: 36 185 362 (threshold: 0.80)
rlrp90: 31 419 572 (threshold: 0.90)
rlrp95: 24 704 936 (threshold: 0.95)
rlrp99: 10 869 289 (threshold: 0.99992)

Conclusion: iterative nlerp is more tweakable than slerp and equally fast for the same error.

Not so random quaternions
[icode]quaternion.set(r.nextDouble()+5, r.nextDouble()+5, r.nextDouble()+5, r.nextDouble()+5).normalize();[/icode]

nlerp: 110 896 134
slerp: 13 177 334
rlrp50: 105 594 541 (threshold: 0.50)
rlrp80: 106 932 808 (threshold: 0.80)
rlrp90: 106 247 803 (threshold: 0.90)
rlrp95: 106 819 391 (threshold: 0.95)
rlrp99: 23 367 739 (threshold: 0.99992)

Conclusion: iterative nlerp is almost as fast as nlerp for most thresholds, and still 75% faster than slerp for similar error.

Identical quaternions
[icode]quaternion.set(1, 1, 1, 1).normalize();[/icode]

nlerp: 120 392 068
slerp: 162 841 905
rlrp50: 217 787 085 (threshold: 0.50)
rlrp80: 219 101 292 (threshold: 0.80)
rlrp90: 219 481 200 (threshold: 0.90)
rlrp95: 219 058 424 (threshold: 0.95)
rlrp99: 219 265 376 (threshold: 0.99992)

Conclusion: iterative nlerp is 30% faster than slerp() for a negligable increase in error.

Will do performance tests in WSW when I get home tomorrow.

Roquen · August 19, 2015, 10:04am

@Riven - Ya know? My comment about 1/sqrt is BS. I’m assuming that rawBits and reverse are still slow. That may not be the case. Haven’t looked in a long time. If they’re both intrinsics and produce sane results…that’d be the way to go (at least if you need to deal with the larger end of the range).

HotSpot does (or has) produces sqrtss (single sqrt) instructions…don’t recall what patterns it matches.


  /**
   * Given the dot product between two quaternions 'd' which
   * are going to be lerped, find the 't' of the maximum
   * abs angular error and the error.
   * <p>
   * The error is symmetric about t=.5, returns results from [0,.5]
   * <p>
   * r[0] = expected t of max error
   * r[1] = expected max abs error
   * <p>
   * r[0] is on 1/2-[1/(2 sqrt(3)), 1/2 sqrt(4/pi-1),  ] ~= [.211325, .238638]
   * r[0] = ~.238638 for angle = Pi/2
   * r[0] = ~.217459 for angle = Pi/4
   * r[0] = ~.211325 for angle = 0 (in the limit)
   * <p>
   * r[1] = abs error in degrees
   */
  public static void maxLerpAngleAbsError(float[] r, float d)
  {
    final double bias = Double.MIN_NORMAL;
    
    // This could be reduced, but can't see a reason to bother.
    double ca = Math.abs(d)+bias;        // cos(a)   
    double sa = Math.sqrt(1.0-ca*ca);    // sin(a)
    double a  = Math.atan2(sa,ca);       // a
    double t0 = ca-1.0;                  // cos(a)-1
    double tn = a*sa*(sa*sa-a*sa+t0*t0);
    double td = 2.0*(a*ca-a);
    double t  = 0.5+Math.sqrt(tn)/td;
    double e  = Math.atan2(t*sa, 1.0+t*(ca-1.0)) - t*a;
    
    e = 2.0*e;                           // Quat measure -> 3D
    e = Math.toDegrees(e);               // sane measure -> insane ;)
   
    r[0] = (float)t;
    r[1] = (float)e;
  }

KaiHH · August 19, 2015, 1:17pm

Thanks! Fixed.

Yeah, I would really like to have empirical evidence for this whether it is faster. I did that because of (maybe) better accuracy. But neither do I know.

Done.

Yeah, that was also noted by @Riven and I changed it right after he noticed it. Or in other words, the better version was actually the first commit that added that method.

theagentd · August 19, 2015, 2:02pm

I was unable to find a good proof or benchmark for why double division is slower than float division online, but I’ll benchmark it when I can.

I had an interesting idea. LibGDX has a Matrix4.avg(Matrix4, float) function that extracts the translation, scale and rotation of two matrices and lerps/slerps between them and finally outputs the result as a new matrix. It was horribly coded, using static temp vectors and quaternions, and performance was horrible. However, if that could be implemented in a thread-safe and fast way, it could possibly massively improve performance of my skeleton animation.

KaiHH · August 19, 2015, 3:03pm

This sounds like a nice idea!
I created a new pastebin http://pastebin.java-gaming.org/df83f3934371d (maybe first test before I do a Git commit).
This implementation is basically just an inline of all the pieces, from extracting the translation, scaling factors and building the rotation quaternions and then nlerpIterative the rotation quaternions and lerp’ing the scale and translation factors.
This makes no assumptions on the matrix. We can however introduce some assumptions, such as you probably only doing uniform scaling (x, y, z being the same scaling factors).

Riven · August 19, 2015, 3:08pm

@theagentd: isn’t the conceptual issue that matrices are a rather poor fit to hold this information in the first place, and it’s just that they are convenient (relatively graspable, as opposed to quaternions) and GPUs are optimized for dot-products, which multiplying a vector by a matrix basically boils down to. To summarize: matrices are nice as a final result, but should not be used as inputs for operations like interpolation.

Basically, to allow for interpolation between matrices, you’d need to decontruct them into a translation component (vec3), a scale component and a orientational component (quaternion), lerp these components, then construct a new matrix from these.

The original data (translation, scale, orientation) is likely to be available elsewhere, so the conversion to and from a matrix (to allow for interpolation) is potentially redundant.

In conclusion: it certainly won’t hurt anybody, but… is it the right thing to do? (and should we care about doing the right thing?)

KaiHH · August 19, 2015, 3:12pm

I totally agree. At various occasions I had the impression that JOML needed a TranslationRotateScale (or somehow abbreviated) class to hold the individual pieces and perform all sorts of operations on them and then finally convert to a Matrix for OpenGL to consume.

Riven · August 19, 2015, 3:18pm

@KaiHH: given your latest pastebin, it seems almost trivial to extract the code from that method to make it happen.


public static void interpolateTransform(
   Vec3 position1,
   Quaternion orientation1,
   Vec3 scale1,

   Vec3 position2,
   Quaternion orientation2,
   Vec3 scale2,

   Matrix4 dest
) {
   // this is where the magic happens...
}


public static void interpolateTransform( // assumes both scales are 1.0
   Vec3 position1,
   Quaternion orientation1,

   Vec3 position2,
   Quaternion orientation2,

   Matrix4 dest
) {
   // this is where the magic happens, once again...
}

KaiHH · August 19, 2015, 3:22pm

Yes, that method was certainly a no-brainer
Being able to accept a decomposition of the components as parameters is also good, but it still converts to a matrix in the end.
If that is indeed the last step in the whole transformation chain of @theagentd, then this would be fine.
However, I still feel the need for a separate class that holds these three components separately.

Riven · August 19, 2015, 3:36pm

So you wanna keep it pure, eh? :point:


TranslationRotationScale.interpolate(trs1, trs2, alpha, trsDest);
trsDest.toMatrix(matDest);

KaiHH · August 19, 2015, 3:40pm

Yes. I like pure things. ;D
I like the symmetry between input and output of the matrix.interpolate(matrix) method.
But the only reason that would justify having an additional class (which in itself is also a complexity booster, and I would want to avoid adding new classes as much as pssible) would be if someone needs to perform more than one such interpolation.
As I said, if one interpolation is the final step in a transformation which after that does not need to do any interpolations anymore, we can get away with your proposed method perfectly.

pitbuller · August 19, 2015, 5:34pm

Matrix deconstruction without hefty assumptions is never going to be high performant. I needed to code this without corner cases for slicing game(Kinghunt) where I needed to dynamically reparent stuff and transformation chains were concatenated to matrices and didn’t have time to change that. It was a very long day. If you have GPU Pro 5 in hand there is good article about ways to manage object hierarchies. http://www.crcnetbase.com/doi/abs/10.1201/b16721-29

Roquen · August 19, 2015, 7:08pm

Most double operations have longer latency than single precision…division is one example, multiply is not.

EDIT: http://www.agner.org/optimize/#manuals (Instruction tables). as an example

theagentd · August 19, 2015, 9:42pm

@Riven and @KaiHH

I know that interpolating matrices is a bad ideain this sense, but for Java it might actually be faster. When interpolating between two animation I read from 6 different objects and store the result in 3 others. I have a feeling that interpolating between two matrices and writing to a third could be faster due to better memory locality. In addition, if I store animation bones as matrices instead I would be able to precompute the bind pose multiplication as well. Now that I think about it, that might be possible even without matrix bones. Hmm.

Another solution would be a Transform class with inlined translation, scale and rotation. It would pretty much only need interpolation support and be convertable to a matrix.

Another suggestion: Matrix4*.getTranslationRotationScale(Vector3, Quaternion, Vector3), inverse of translationRotateScale().

Roquen · August 20, 2015, 7:50am

“One problem…” (implies more than one) “…is decomposing…” (yes, yes) “…matrix back to…” (linear algebra is a really nice hammer) “…ill defined problem and no robust…” (and every problem is a nail) “…approximate solution…” (cringe) “…works reasonably well in the majority of cases.” (is that like: the letter’s in the mail?).

In 3-space general rotations: 3 degrees of freedom, translation: 3 degrees of freedom, scale 1 degree: 7 total degrees of freedom. scale+quat+vect = 8. 4x3 LA=12, 4x4 LA=16. Inverting: scale = multiplicative inverse, rotation=3 logical sign flips, translation=3 logical sign flips. Reverse the order of composition and carry through the derivation. The only place where matrices is a logical win is for bulk transforms since it is automatically composing common sub-expressions. However if you’re bulk transforming then manually collecting the common sub-expressions is equivalent to converting the entire transform into a matrix (in local variables) at the point where you actually need them.

And no I’m not saying anyone is an idiot if they use matrices…just to get that out of the way.

ags1 · August 20, 2015, 10:22am

I’d like to use JOML for 2D transformations, to get away from the rather naive calculations I do with trig functions… My maths is rather spotty, so is there, or will there be, a JOML For Dummies tutorial?