Extremely Fast sine/cosine

Icecore · July 27, 2015, 8:00pm

Fix it)


	static final private float Pi = (float)Math.PI;
	static final private float Pi_D = Pi * 2;
	static final private int Size_SC_Ac = 5000;
	static final private int Size_SC_Ar = Size_SC_Ac + 1;
	static final private float Sin[] = new float[Size_SC_Ar];
	static final private float Cos[] = new float[Size_SC_Ar];
	static final private float Pi_SC_D = Pi_D / Size_SC_Ac;
	static final private int Pi_SC_D_Rem = (int)(Pi_D / Pi_SC_D);
    static{
         for(int i = 0; i < Size_SC_Ar; i++){
            double d = i * Pi_SC_D;
            Sin[i] = (float) Math.sin(d);
            Cos[i] = (float) Math.cos(d);
         }
    }
      
	static final public float sin(float r){
		if(r < 0){
			return Sin[((int)(r / Pi_SC_D) % Pi_SC_D_Rem) + Pi_SC_D_Rem];
		}
		else{
			return Sin[(int)(r / Pi_SC_D) % Pi_SC_D_Rem];
		}
	}
	
	static final public float cos(float r){
		if(r < 0){
			return Cos[((int)(r / Pi_SC_D) % Pi_SC_D_Rem) + Pi_SC_D_Rem];
		}
		else{
			return Cos[(int)(r / Pi_SC_D) % Pi_SC_D_Rem];
		}
	}


linear progression -0.5PI..+0.5PI:
	 java.math   64.369 ns/op
	 devmaster   29.441 ns/op
	 icecore     13.827 ns/op
	 riven       11.395 ns/op
	 kappa        8.976 ns/op

linear progression -8PI..+8PI:
	 java.math   93.948 ns/op
	 devmaster   30.792 ns/op
	 icecore     14.076 ns/op
	 riven       11.319 ns/op
	 kappa        9.439 ns/op

input float[]/L1 -0.5PI..+0.5PI:
	 java.math   75.743 ns/op
	 devmaster   34.110 ns/op
	 icecore     23.574 ns/op
	 riven        3.197 ns/op
	 kappa       16.083 ns/op

Icecore · July 27, 2015, 8:16pm

Use Riven Bit =)


	static final private float Pi = (float)Math.PI;
	static final private float Pi_D = Pi * 2;
	static final private int Size_SC_Bits_Ac = 12;
	static final private int Size_SC_Ar = 1 << Size_SC_Bits_Ac;
	static final private int Size_SC_Bits_M = Size_SC_Ar - 1;
	static final private float Sin[] = new float[Size_SC_Ar];
	static final private float Cos[] = new float[Size_SC_Ar];
	static final private float Pi_SC_D = Pi_D / Size_SC_Ar;
    static{
         for(int i = 0; i < Size_SC_Ar; i++){
            double d = i * Pi_SC_D;
            Sin[i] = (float) Math.sin(d);
            Cos[i] = (float) Math.cos(d);
         }
    }
      
	static final public float sin(float r){
		return Sin[(int)(r / Pi_SC_D) & Size_SC_Bits_M];
	}
	
	static final public float cos(float r){
		return Cos[(int)(r / Pi_SC_D) & Size_SC_Bits_M];
	}

Riven · July 27, 2015, 8:22pm

now replace [icode]r / Pi_SC_Dp[/icode] with [icode]r * Inv_Pi_SC_Dp[/icode] and you copied my code exactly :

Icecore · July 27, 2015, 8:33pm

Yep ^^
[spoiler]i can’t say copy i change “if with %” on “&” with size align to bit natural grov =)
but y, i see such optimization first time here, so thanks for share it ;)[/spoiler]
(technically it one algo so no need add duplicate in chart)

But i don’t know why


-0.5PI .. +0.5PI	min err		max err		avg err		stddev
icecore   		0.000000	0.001532	0.000485	0.000391
riven     		0.000000	0.002299	0.000609	0.000559

on 12 BITS

mooman219 · July 27, 2015, 10:13pm

After adding Icecore’s I accidentally used your method for the test, resulting in his being just as fast as yours ;); didn’t catch it until this morning. I even updated the benchmark code with the mistake in it but I guess no one caught it either. On a sidenote, I am looking into better ways to test a range of results in JMH without increasing the test time by an order of magnitude, but the results have been pretty consistent regardless.

Icecore · July 28, 2015, 1:09am

add lerp to code, don’t know why)
on 12 bits %f give Largest Error 0,000003)


	static final private float Pi = (float)Math.PI;
	static final private float Pi_D = Pi * 2;
	static final private int Size_SC_Bits_Ac = 12;
	static final private int Size_SC_Ar = 1 << Size_SC_Bits_Ac;
	static final private int Size_SC_Bits_M = Size_SC_Ar - 1;
	static final private float Sin[] = new float[Size_SC_Ar + 1];
	static final private float Cos[] = new float[Size_SC_Ar + 1];
	static final private float Pi_SC_D = Pi_D / Size_SC_Ar;
	static{
         for(int i = 0; i < Size_SC_Ar + 1; i++){
            double d = i * Pi_SC_D;
            Sin[i] = (float) Math.sin(d);
            Cos[i] = (float) Math.cos(d);
         }
	}

	static final public float sin_L_theagentd(float rad) {
	      float index = rad / Pi_SC_D;
	      
	      int ii = (int)index;
	      float alpha = index - ii;
	      
	      int i = ii & Size_SC_Bits_M;
	      
	      float sin1 = Sin[i];
	      float sin2 = Sin[i + 1];
	      
	      return sin1 * (1f - alpha) + sin2 * alpha;
	}
	
		
	static final public float cos_L_theagentd(float rad) {
	      float index = rad / Pi_SC_D;
	      
	      int ii = (int)index;
	      float alpha = index - ii;
	      
	      int i = ii & Size_SC_Bits_M;
	      
	      float cos1 = Cos[i];
	      float cos2 = Cos[i + 1];
	      
	      return cos1 * (1f - alpha) + cos2 * alpha;
	}

p.s try add lerp to atan - but code becomes too heavy visual, so trow it)

KaiHH · October 2, 2015, 10:22pm

That’s pretty nice, indeed!
I am currently experimenting with all sorts of fast sin/cos approximations to have the “best” of them included in JOML.
With Riven’s permission I now have his method in, but I have to say it really suffers from discretization, which is noticeable when one lowers the bit depth of the lookup table below, say, 10 bits. This results in (for example) a camera to stutter in rotation.
Using linear interpolation completely solves that problem!
Now I can really reduce the bit depth of the lookup table to 6 bits and still have ten times the accuracy of the non-interpolating algorithm.
I don’t know about the performance, though, since I don’t know how to really make a real-world benchmark with it.

[s]EDIT: Hmmm… there is a small bug in your linear interpolation code. If you compute sin or cos with an argument of 2.0 * PI - 0.0001 then it raises an ArrayIndexOutOfBoundsException. It only happens with values near below 2 PI

EDIT 2: I think the “bug” can be fixed with computing the modulo of (i + 1), too, like so:


float sin2 = sinTable[(i + 1) & lookupTableSizeMinus1];

[/s]

Icecore · October 3, 2015, 2:02am

Strange, don’t have any error

		float res = sin_L_theagentd((float)(2.0 * Pi - 0.0001));
		float res2 = sin_L_theagentd((float)(2.0 * Pi + 0.0001));
		float res3 = sin_L_theagentd((float)(2.0 * Pi));

*I thought i prevent this, hm…

int Size_SC_Bits_M = Size_SC_Ar - 1;
Sin[] = new float[Size_SC_Ar + 1];

int i = ii & Size_SC_Bits_M; // array size more at 2
float sin2 = Sin[i + 1];

lol i don’t remember why +1,-1 offset, but remember - that was important ^^

theagentd · October 3, 2015, 4:43am

I solved that.

Roquen · October 3, 2015, 7:08am

sine/cosine, et al large table based look-ups are only fast in naive benchmarks and very narrow cases of signal processing.

KaiHH · October 3, 2015, 9:18am

Ahh sorry, @theagentd. I somewhat overlooked that +1 in the array allocation.
But regarding @Roquen’s comment, can you probably give a vague percentage of how much faster the table-based lookup is compared to other methods in WSW maybe?

Riven · October 3, 2015, 10:26am

At such few discrete values (2^6=64, 360/64=5.6deg per value) you start to need to renormalize your interpolated values… I’d say that you’d at least use 8 bits, for game purposes.

`
cos(0.0deg) = 1.0
sin(0.0deg) = 0.0

cos(5.6deg) = 0.99522739998
sin(5.6deg) = 0.09758289975

cos(2.8deg) = 0.99880613734
sin(2.8deg) = 0.04884976979

dx = (cos(0.0deg) + cos(5.6deg)) * 0.5 = 0.99761369999
dy = (sin(0.0deg) + sin(5.6deg)) * 0.5 = 0.04879144987
err = 1.0 - sqrt(dxdx + dydy)
err = 0.00119386265
`

this might seem low, but this inaccuracy will quickly propagate up to noticeable errors, not in direction but in magnitude.

KaiHH · October 3, 2015, 10:38am

Yes, indeed. After applying alot of small rotations to a matrix, there was not only rotation but also some scaling effects as well.
I ended up using 9 bits.

Riven · October 3, 2015, 10:48am

As cache is that important, simply merge the cos and sin lookup tables and test whether that reduces cache-misses to the point the overall performance (in real world tests) increases.


public static float cos(float v) {
   return sin(v + PI2);
}

with 9 bits, that would use (512+1)*4 = 2052 bytes, which hopefully lies in one 4K page.

Roquen · October 3, 2015, 2:06pm

KaiHH: I’d be in state of sin to attempt to generalize numbers. Where would you think you could get a performance increase in JOML via table look-up?

theagentd · October 3, 2015, 2:32pm

The only time I use lots of sin() calculations is for our swinging grass to simulate a wind force based on the position of each straw.

Math.sin(): 15.01ms
9-bit lookup with interpolation: 4.45ms
6-bit lookup with interpolation: 4.45ms

This is for the entirety of the grass calculation, so it does lots of other things too.

Roquen · October 3, 2015, 2:41pm

That’s when table-based is at it’s best…you gotta make that cache load pay off. What’s the math…I’m not seeing an obvious sin here?

theagentd · October 3, 2015, 3:55pm


public class FastSineWindFunction implements WindFunction{

	@Override
	public float getWindX(float time, float x, float y) {
		return FastTrigonometry2.sin(time + x);
	}

	@Override
	public float getWindY(float time, float x, float y) {
		return FastTrigonometry2.sin(time + y);
	}
}

I also have an implementation which does Math.sin().

Archive · October 3, 2015, 4:19pm

16.16 fixed point LUT.


	public static int[] sin = new int[2048];

	public static int[] cos = new int[2048];

	static {
		for (int i = 0; i < 2048; i++) {
			sin[i] = (int) (65536.0 * Math.sin(i * 0.0030679615));
			cos[i] = (int) (65536.0 * Math.cos(i * 0.0030679615));
		}
	}

Usage


int angle = 1024; (angle range is between 0 and 2047)
(should &= 0x7FF any variable angles to prevent out of bounds errors)

int value = 100;
int multiplied = value * sin[angle] >> 16;

Roquen · October 3, 2015, 4:45pm

sum of angles.