Extremely Fast sine/cosine

Fix it)


	static final private float Pi = (float)Math.PI;
	static final private float Pi_D = Pi * 2;
	static final private int Size_SC_Ac = 5000;
	static final private int Size_SC_Ar = Size_SC_Ac + 1;
	static final private float Sin[] = new float[Size_SC_Ar];
	static final private float Cos[] = new float[Size_SC_Ar];
	static final private float Pi_SC_D = Pi_D / Size_SC_Ac;
	static final private int Pi_SC_D_Rem = (int)(Pi_D / Pi_SC_D);
    static{
         for(int i = 0; i < Size_SC_Ar; i++){
            double d = i * Pi_SC_D;
            Sin[i] = (float) Math.sin(d);
            Cos[i] = (float) Math.cos(d);
         }
    }
      
	static final public float sin(float r){
		if(r < 0){
			return Sin[((int)(r / Pi_SC_D) % Pi_SC_D_Rem) + Pi_SC_D_Rem];
		}
		else{
			return Sin[(int)(r / Pi_SC_D) % Pi_SC_D_Rem];
		}
	}
	
	static final public float cos(float r){
		if(r < 0){
			return Cos[((int)(r / Pi_SC_D) % Pi_SC_D_Rem) + Pi_SC_D_Rem];
		}
		else{
			return Cos[(int)(r / Pi_SC_D) % Pi_SC_D_Rem];
		}
	}


linear progression -0.5PI..+0.5PI:
	 java.math   64.369 ns/op
	 devmaster   29.441 ns/op
	 icecore     13.827 ns/op
	 riven       11.395 ns/op
	 kappa        8.976 ns/op

linear progression -8PI..+8PI:
	 java.math   93.948 ns/op
	 devmaster   30.792 ns/op
	 icecore     14.076 ns/op
	 riven       11.319 ns/op
	 kappa        9.439 ns/op

input float[]/L1 -0.5PI..+0.5PI:
	 java.math   75.743 ns/op
	 devmaster   34.110 ns/op
	 icecore     23.574 ns/op
	 riven        3.197 ns/op
	 kappa       16.083 ns/op

Use Riven Bit =)


	static final private float Pi = (float)Math.PI;
	static final private float Pi_D = Pi * 2;
	static final private int Size_SC_Bits_Ac = 12;
	static final private int Size_SC_Ar = 1 << Size_SC_Bits_Ac;
	static final private int Size_SC_Bits_M = Size_SC_Ar - 1;
	static final private float Sin[] = new float[Size_SC_Ar];
	static final private float Cos[] = new float[Size_SC_Ar];
	static final private float Pi_SC_D = Pi_D / Size_SC_Ar;
    static{
         for(int i = 0; i < Size_SC_Ar; i++){
            double d = i * Pi_SC_D;
            Sin[i] = (float) Math.sin(d);
            Cos[i] = (float) Math.cos(d);
         }
    }
      
	static final public float sin(float r){
		return Sin[(int)(r / Pi_SC_D) & Size_SC_Bits_M];
	}
	
	static final public float cos(float r){
		return Cos[(int)(r / Pi_SC_D) & Size_SC_Bits_M];
	}

now replace [icode]r / Pi_SC_Dp[/icode] with [icode]r * Inv_Pi_SC_Dp[/icode] and you copied my code exactly ::slight_smile:

Yep ^^
[spoiler]i can’t say copy i change “if with %” on “&” with size align to bit natural grov =)
but y, i see such optimization first time here, so thanks for share it ;)[/spoiler]
(technically it one algo so no need add duplicate in chart)

But i don’t know why


-0.5PI .. +0.5PI	min err		max err		avg err		stddev
icecore   		0.000000	0.001532	0.000485	0.000391
riven     		0.000000	0.002299	0.000609	0.000559

on 12 BITS

After adding Icecore’s I accidentally used your method for the test, resulting in his being just as fast as yours ;); didn’t catch it until this morning. I even updated the benchmark code with the mistake in it but I guess no one caught it either. On a sidenote, I am looking into better ways to test a range of results in JMH without increasing the test time by an order of magnitude, but the results have been pretty consistent regardless.

add lerp to code, don’t know why)
on 12 bits %f give Largest Error 0,000003)


	static final private float Pi = (float)Math.PI;
	static final private float Pi_D = Pi * 2;
	static final private int Size_SC_Bits_Ac = 12;
	static final private int Size_SC_Ar = 1 << Size_SC_Bits_Ac;
	static final private int Size_SC_Bits_M = Size_SC_Ar - 1;
	static final private float Sin[] = new float[Size_SC_Ar + 1];
	static final private float Cos[] = new float[Size_SC_Ar + 1];
	static final private float Pi_SC_D = Pi_D / Size_SC_Ar;
	static{
         for(int i = 0; i < Size_SC_Ar + 1; i++){
            double d = i * Pi_SC_D;
            Sin[i] = (float) Math.sin(d);
            Cos[i] = (float) Math.cos(d);
         }
	}

	static final public float sin_L_theagentd(float rad) {
	      float index = rad / Pi_SC_D;
	      
	      int ii = (int)index;
	      float alpha = index - ii;
	      
	      int i = ii & Size_SC_Bits_M;
	      
	      float sin1 = Sin[i];
	      float sin2 = Sin[i + 1];
	      
	      return sin1 * (1f - alpha) + sin2 * alpha;
	}
	
		
	static final public float cos_L_theagentd(float rad) {
	      float index = rad / Pi_SC_D;
	      
	      int ii = (int)index;
	      float alpha = index - ii;
	      
	      int i = ii & Size_SC_Bits_M;
	      
	      float cos1 = Cos[i];
	      float cos2 = Cos[i + 1];
	      
	      return cos1 * (1f - alpha) + cos2 * alpha;
	}

p.s try add lerp to atan - but code becomes too heavy visual, so trow it)

That’s pretty nice, indeed!
I am currently experimenting with all sorts of fast sin/cos approximations to have the “best” of them included in JOML.
With Riven’s permission I now have his method in, but I have to say it really suffers from discretization, which is noticeable when one lowers the bit depth of the lookup table below, say, 10 bits. This results in (for example) a camera to stutter in rotation.
Using linear interpolation completely solves that problem!
Now I can really reduce the bit depth of the lookup table to 6 bits and still have ten times the accuracy of the non-interpolating algorithm.
I don’t know about the performance, though, since I don’t know how to really make a real-world benchmark with it.

[s]EDIT: Hmmm… there is a small bug in your linear interpolation code. If you compute sin or cos with an argument of 2.0 * PI - 0.0001 then it raises an ArrayIndexOutOfBoundsException. It only happens with values near below 2 PI

EDIT 2: I think the “bug” can be fixed with computing the modulo of (i + 1), too, like so:


float sin2 = sinTable[(i + 1) & lookupTableSizeMinus1];

[/s]

Strange, don’t have any error

		float res = sin_L_theagentd((float)(2.0 * Pi - 0.0001));
		float res2 = sin_L_theagentd((float)(2.0 * Pi + 0.0001));
		float res3 = sin_L_theagentd((float)(2.0 * Pi));

*I thought i prevent this, hm…

int Size_SC_Bits_M = Size_SC_Ar - 1;
Sin[] = new float[Size_SC_Ar + 1];

int i = ii & Size_SC_Bits_M; // array size more at 2
float sin2 = Sin[i + 1];

lol i don’t remember why +1,-1 offset, but remember - that was important ^^

I solved that.

sine/cosine, et al large table based look-ups are only fast in naive benchmarks and very narrow cases of signal processing.

Ahh sorry, @theagentd. I somewhat overlooked that +1 in the array allocation.
But regarding @Roquen’s comment, can you probably give a vague percentage of how much faster the table-based lookup is compared to other methods in WSW maybe?

At such few discrete values (2^6=64, 360/64=5.6deg per value) you start to need to renormalize your interpolated values… I’d say that you’d at least use 8 bits, for game purposes.

`
cos(0.0deg) = 1.0
sin(0.0deg) = 0.0

cos(5.6deg) = 0.99522739998
sin(5.6deg) = 0.09758289975

cos(2.8deg) = 0.99880613734
sin(2.8deg) = 0.04884976979

dx = (cos(0.0deg) + cos(5.6deg)) * 0.5 = 0.99761369999
dy = (sin(0.0deg) + sin(5.6deg)) * 0.5 = 0.04879144987
err = 1.0 - sqrt(dxdx + dydy)
err = 0.00119386265
`

this might seem low, but this inaccuracy will quickly propagate up to noticeable errors, not in direction but in magnitude.

Yes, indeed. After applying alot of small rotations to a matrix, there was not only rotation but also some scaling effects as well.
I ended up using 9 bits.

As cache is that important, simply merge the cos and sin lookup tables and test whether that reduces cache-misses to the point the overall performance (in real world tests) increases.


public static float cos(float v) {
   return sin(v + PI2);
}

with 9 bits, that would use (512+1)*4 = 2052 bytes, which hopefully lies in one 4K page.

KaiHH: I’d be in state of sin to attempt to generalize numbers. Where would you think you could get a performance increase in JOML via table look-up?

The only time I use lots of sin() calculations is for our swinging grass to simulate a wind force based on the position of each straw.

Math.sin(): 15.01ms
9-bit lookup with interpolation: 4.45ms
6-bit lookup with interpolation: 4.45ms

This is for the entirety of the grass calculation, so it does lots of other things too.

That’s when table-based is at it’s best…you gotta make that cache load pay off. What’s the math…I’m not seeing an obvious sin here?


public class FastSineWindFunction implements WindFunction{

	@Override
	public float getWindX(float time, float x, float y) {
		return FastTrigonometry2.sin(time + x);
	}

	@Override
	public float getWindY(float time, float x, float y) {
		return FastTrigonometry2.sin(time + y);
	}
}

I also have an implementation which does Math.sin().

16.16 fixed point LUT.


	public static int[] sin = new int[2048];

	public static int[] cos = new int[2048];

	static {
		for (int i = 0; i < 2048; i++) {
			sin[i] = (int) (65536.0 * Math.sin(i * 0.0030679615));
			cos[i] = (int) (65536.0 * Math.cos(i * 0.0030679615));
		}
	}

Usage


int angle = 1024; (angle range is between 0 and 2047)
(should &= 0x7FF any variable angles to prevent out of bounds errors)

int value = 100;
int multiplied = value * sin[angle] >> 16;


sum of angles.