Extremely Fast sine/cosine

BurntPizza · October 8, 2015, 3:12am

Yep, from 50.8 us/10⁴ ops to 46.4. But with Math.floor its about 90.

Current code, hopefully last one for today: http://pastebin.java-gaming.org/b2bda6b7a361e

abcdef · October 8, 2015, 5:18am

These latest benchmarks aren’t really giving a fair comparison. There are 2 measurable items

The method for calculating Sin or Cos = Java / Agent / Riven
The method for calculating Sin (t+x) and Sin (t+y) = Scalar / Sum of Angles/ Sum of Angles SOA

To be a fair comparison you need to test all the possible combinations, so far Sum of Angles / Sum of Angles SOA are only tested using the slowest sin and cos method (Java)

BurntPizza · October 8, 2015, 5:34am

This particular benchmark is for the sin(t + x_i,y_j) scenario, as presented by agentd.
The 2nd ‘kind’ of calculation was to implement roquen’s insight on how to algebraically optimize this scenario, as opposed to making the naive formula fast.
You’ll also notice that the sumOfAngles technique’s speed is almost completely independent of the speed of the underlying trig functions. So java stdlib is used for accuracy.

Roquen · October 8, 2015, 9:26am

A note on lerp: f(t) = (1-t)a + tb

It can be expressed in three forms

(1-t)a + tb
a - at + bt
a +t(b-a)

They all have have the same max-chain length (3):
(1-t)a + t b: t00 = (1-f) | t01 = t * b t10 = t00 * a t20 = t10 + t01

a - t a + t b: t00 = t * a | t01 = t * b t10 = a - t00 t20 = t10 + t01

a + t(b - a): t00 = b - a t10 = t * t00 t20 = a + t10

Forms (1) & (2) requires 4 operations and (3) requires, well, 3.

I can’t think of a reason why (2) might be a “better” choice…just mentioned for completeness. Personally I use (3) as my default. The one less instruction to issue means that it’s generally faster. It also is in a form that allows performing a fused-add-multiply (FMA)…but I don’t believe HotSpot generates these instructions yet. With an FMA opcode the chain-length and operation count become 2:
t00 = b - a t10 = t * t00 + a

The potentially big BUT here is this, if t=1:

a + 1.0*(b - a) = a + (b - a)

which generally in floating point does not result in ‘b’. So if having the exact values at the end-point is desired you want to avoid (3). Both (1) & (2) are exact at the endpoints. (2) must be computed as: (a - at) + bt for this to be true.

A note on Math.floor: This has stricter contract than the problem requires and could be replaced with something like:

[icode] floor = x >= 0 ? x : x-1[/icode]

DISCLAIMER: I actually have though this through enough…the x-1 could a be problem…humm…
Also newer hardware have a floor opcode…don’t know if HotSpot emits or not, the numbers above seem to be saying no.

princec · October 8, 2015, 9:55am

My 2p: at least with the games code I’ve written over the last 10 years we almost always want f(1.0)==b, that is, exact. /me ponders why this specific and very common operation never made it into a single machine code instruction.

Cas

Roquen · October 8, 2015, 11:42am

Intel has been thinking about it. The various FMA ops are hard and lerp is harder. I know both Michael Abrash and Tom Forsyth pushed for LERP while at Intel. Abrash gave a presentation on Larrabee where the first slide read: “I never did get that lerp instruction!”

Oh I didn’t have my thinky thinky cap on earlier, form (2) converts into FMA:
t00 = a - t*a // different opcode from FMA related set
t10 = t * b + t00

BurntPizza · October 8, 2015, 2:35pm

@Roquen
From what I can gather HotSpot wouldn’t be allowed to emit FMA ops because they yield different results:

Related: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-January/016809.html

Riven · October 8, 2015, 4:36pm

@theagentd / BurntPizza: use my FastFloor impl instead of (int)Math.floor(…).

// valid input range: -BIG_ENOUGH .. INFINITY-ish
private static final int BIG_ENOUGH_INT = 65536; // whatever makes sense
private static final float BIG_ENOUGH_FLOAT = BIG_ENOUGH_INT;
public static int fastFloor(float x) {
   return (int)(x + BIG_ENOUGH_FLOAT) - BIG_ENOUGH_INT;
}

You trade input range for accuracy.

Not a single branch, nor cache trashing lookup table :point:

Roquen · October 8, 2015, 7:52pm

I wholeheartedly endorse Riven’s floor. The domain reduction shouldn’t be an issue here. In cases where it is then the range is probably problematic elsewhere anyway. It will flush small negative numbers to positive zero however…that small window is important not to forget.

On fma. Hotspot could not issue in strictfp cases, otherwise it would be legal. All of this is more in the future kinda thing anyway…fma ops are slower atm if memory serves.

All of this is from cell…otherwise I would have been more verbose. For real even.

EDIT: for typos

More notes of FMA. The compiler dev list chain starts with “a bug”. Someone added to the compiler a transform that would convert “Math.pow(x,2)” into xx. The later is more faster and more accurate. The bug filed was that the interpreter would compute Math.pow(x,2) and once the code got compiled it would compute xx…thus different results between the two. The proposed solution was to fix the interpreter to detect and actually compute x*x as well.

The persons who’s e-mail you linked was saying: This is not a bug, the method is not marked strictfp…the transform should be considered legal. And if you go down this route of attempting to have bit exact results across the interpreter and all compilers then you’re unreasonably restricting the optimizations the compiler can perform because it’s impossible for the interpreter to know how the code is going to be optimized…in fact it’s part of the interpreter’s job to collection information so the compiler can do it’s thing. The usage of FMA comes up in that context. That author missed an important point: This is exactly the situation the JVM was in before it started using SSE registers. Computation would be performed on x87 in 64 bit mode…all 32-bit float ops were generating different results than the interpreter in non-strictfp methods.

The stackover thread doesn’t apply. It basically about strictfp on modern CPU doesn’t do much of anything since hotspot isn’t taking advantage of operations like FMA at the moment.

mooman219 · October 14, 2015, 10:33pm

Riven:

@theagentd / BurntPizza: use my FastFloor impl instead of (int)Math.floor(…).
// valid input range: -BIG_ENOUGH .. INFINITY-ish
private static final int BIG_ENOUGH_INT = 65536; // whatever makes sense
private static final float BIG_ENOUGH_FLOAT = BIG_ENOUGH_INT;
public static int fastFloor(float x) {
   return (int)(x + BIG_ENOUGH_FLOAT) - BIG_ENOUGH_INT;
}
You trade input range for accuracy.

Not a single branch, nor cache trashing lookup table :point:

Casual reminder to make the type of

BIG_ENOUGH_FLOAT

to double otherwise there are rounding errors.
For example, the fastFloor function returns 100 for 99.999F when using just floats.