Are Sin/Cos lookup tables still relevant today with regards to performance?

heisenbergman · June 10, 2013, 9:06am

Just asking because I created this thread on TIG: Is there a known system of converting degrees to (x,y) coordinates? (which so blatantly exposes that I totally fail at trigonometry, but that’s beside the point, lol :P).

If you’ll notice somewhere in the middle of Page 1, someone drops by the thread and suggests that I could implement Cos/Sin lookup by storing the values into a table for future reference. He subsequently gets ripped to shreds by some of the members of TIG.

Only later on by some serendipitous turn of fate (particularly by the timely necrobump of this thread) do I realize that he was in fact referencing Riven’s sin/cos lookup thread here on JGO.

That suggestion to use sin/cose lookup tables has gotten some hostile responses on that thread. Such as:

[quote]It is NOT an alternative. Just forget about it, ok?

Reasoning: If you use sin/cos like this in normal code, the first level cache is usually filled with the current data. A table-based approach would require parts of the table to be loaded to first level cache, and would flush other data from there. But the table needs to be in L1 cache to be efficient, even a L2 access would already take longer than the basic math to calculate a sin/cos. It’s just a few multiplications and additions after all. Look up taylor series if you want to see how few operations exactly. And all of those happen in registers without memory access, some parts even parallelized by the CPU’s OOO pipeline.
[/quote]

[quote]What I see here is deprecated wisdom. Ten years ago a table might have saved the day. Today’s CPUs are far more complex and incredibly more smart, and the burden has shifted heavily towards memory access.
[/quote]
Now, I wonder… is this approach towards looking-up sin/cos values still relevant for performance today? Or do the posts calling this logic “deprecated wisdom” have some value?

Riven · June 10, 2013, 10:23am

I don’t really care about people using CAPITALS to make their point

All I know is that in real-world use cases, my sin/cos lookup tables make a REAL difference in performance. :point:

pjt33 · June 10, 2013, 10:53am

[quote=“heisenbergman,post:1,topic:42462”]
There is only one way to get a useful answer to this: profile. I would start out using the standard library and only look at replacements if you see Math.sin or Math.cos actually showing up in your profiling reports.

princec · June 10, 2013, 11:03am

+1 to what Riven said - we get significant performance boosts using LUTs. Many of the TIGs deriders probably forget that the intrinsic sin/cos functions have to be mathematically correct for all possible parameter values whereas we can make an optimisation knowing that we have a pretty small and finite set of values and a less stringent requirement for accuracy.

Cas

Oskuro · June 10, 2013, 2:02pm

[quote=“heisenbergman,post:1,topic:42462”]
That someone being me :emo:

heisenbergman · June 10, 2013, 2:29pm

oh hai thar Oskuro! I should have known you’d be here. I appreciate the contribution to that thread. I might not be using this approach for now but it’s nice to be aware of it.

Oskuro · June 10, 2013, 2:44pm

You’re welcome!

I essentially made the suggestion because I’m adopting the approach for my game, so I kind of had it on my mind.

Wish I had it running so I could do some profiling, though.

Roquen · June 10, 2013, 3:37pm

It depends on the real usage patterns, including how memory is being accessed otherwise and what you’re comparing it against. My experience is no…but that doesn’t mean a alot.

Cero · June 10, 2013, 4:28pm

Sry I do that too.
Reason is in simple chats/irc you dont have formatting often times, and its a habit =D

Oskuro · June 10, 2013, 9:00pm

I’m still having trouble understanding how exactly is Riven using the SIN_MASK to filter angle values.

I kind of understand it, but not enough to tinker with it so I can clamp other value ranges.

relminator · June 11, 2013, 1:40am

SIN_MASK is a powof2value-1. Basically you construct your lut to have a pow of two max index so that you don’t use the slower mod % or an if then else to check for out of range. AND is much faster than a DIV.

I would hazard a guess that riven converted his angles to Bradians (binary radians where a full revolution is a POT)

Been using this trick in some of my older games and gfx demos.

There’s also another trick to minimize memory use by 1/8 by “octantifying” angles since sin and cos are periodic (actually 1/16 if you factor the fact that sin and cos are complimentary so you can just calculate either by adding 1/2pi ti the index). This is not useful on the pc though as we have memory to burn.

Roquen · June 11, 2013, 5:26am

“memory to burn” is both true and very untrue. Memory size isn’t so important…memory motion is very very important.

relminator · June 11, 2013, 6:05am

I was speaking in terms of size instead of speed.

Roquen · June 11, 2013, 12:10pm

Well the real answer that that thread is: don’t use angles…angles hate you, so hate 'em back.

Riven · June 11, 2013, 1:38pm

I’d like to say that ‘my’ lookup tables take 16KB of memory, or 4 pages. That means that they are basically always in L2 cache, and not big enough to push other data out. They can even pretty much fully stay within L1 cache, in your inner loop.


math=202,067us	full[]=7,586us	half[]=12,851us	taylor=26,783us
math=201,610us	full[]=7,701us	half[]=12,878us	taylor=26,448us
math=202,474us	full[]=7,672us	half[]=12,886us	taylor=26,438us
math=202,321us	full[]=7,592us	half[]=12,864us	taylor=26,306us

I know it’s micro-benchmark, but it’s rotating vectors, which is a pretty common use case:


import java.util.Random;

public class FastMathBench {
	public static void main(String[] args) {
		System.out.println(Math.sin(0.5f));
		System.out.println(sinFull(0.5f));
		System.out.println(sinHalf(0.5f));
		System.out.println(sinTaylor(0.5f));
		System.out.println();

		System.out.println(Math.sin(4.5f));
		System.out.println(sinFull(4.5f));
		System.out.println(sinHalf(4.5f));
		System.out.println(sinTaylor(4.5f));
		System.out.println();

		float[] angles = new float[1024 * 1024];
		float[] vectors = new float[angles.length * 2];

		Random r = new Random(235);
		for (int i = 0; i < angles.length; i++) {
			angles[i] = (float) (r.nextDouble() * Math.PI * 2.0);
			vectors[i * 2 + 0] = r.nextFloat();
			vectors[i * 2 + 1] = r.nextFloat();
		}

		for (int i = 0; i < 16; i++) {
			long t0 = System.nanoTime();
			benchMathSin(vectors, angles);
			long t1 = System.nanoTime();
			benchLookupFullSin(vectors, angles);
			long t2 = System.nanoTime();
			benchLookupHalfSin(vectors, angles);
			long t3 = System.nanoTime();
			benchLookupTaylorSin(vectors, angles);
			long t4 = System.nanoTime();

			long tMath = (t1 - t0) / 1000;
			long tLUT1 = (t2 - t1) / 1000;
			long tLUT2 = (t3 - t2) / 1000;
			long tTayl = (t4 - t3) / 1000;

			System.out.println("math=" + tMath + "us\tfull[]=" + tLUT1 + "us\thalf[]=" + tLUT2 + "us\ttaylor=" + tTayl + "us");
		}
	}

	private static void benchMathSin(float[] vectors, float[] angles) {
		for (int i = 0; i < angles.length; i++) {
			float x = vectors[i * 2 + 0];
			float y = vectors[i * 2 + 1];
			float a = angles[i];

			float cos = (float) Math.cos(a);
			float sin = (float) Math.sin(a);

			vectors[i * 2 + 0] = x * cos - y * sin;
			vectors[i * 2 + 1] = y * cos + x * sin;
		}
	}

	private static void benchLookupFullSin(float[] vectors, float[] angles) {
		for (int i = 0; i < angles.length; i++) {
			float x = vectors[i * 2 + 0];
			float y = vectors[i * 2 + 1];
			float a = angles[i];

			float cos = sinFull(a);
			float sin = cosFull(a);

			vectors[i * 2 + 0] = x * cos - y * sin;
			vectors[i * 2 + 1] = y * cos + x * sin;
		}
	}

	private static void benchLookupHalfSin(float[] vectors, float[] angles) {
		for (int i = 0; i < angles.length; i++) {
			float x = vectors[i * 2 + 0];
			float y = vectors[i * 2 + 1];
			float a = angles[i];

			float cos = cosHalf(a);
			float sin = sinHalf(a);

			vectors[i * 2 + 0] = x * cos - y * sin;
			vectors[i * 2 + 1] = y * cos + x * sin;
		}
	}

	private static void benchLookupTaylorSin(float[] vectors, float[] angles) {
		for (int i = 0; i < angles.length; i++) {
			float x = vectors[i * 2 + 0];
			float y = vectors[i * 2 + 1];
			float a = angles[i];

			float cos = cosTaylor(a);
			float sin = sinTaylor(a);

			vectors[i * 2 + 0] = x * cos - y * sin;
			vectors[i * 2 + 1] = y * cos + x * sin;
		}
	}

	private static final float RAD, DEG, SIN_TO_COS;
	private static final int SIN_BITS, SIN_MASK, SIN_MASK2, SIN_COUNT, SIN_COUNT2;
	private static final float radFull, radToIndex;
	private static final float degFull, degToIndex;
	private static final float[] sinFull, sinHalf;

	static {
		RAD = (float) Math.PI / 180.0f;
		DEG = 180.0f / (float) Math.PI;
		SIN_TO_COS = (float) (Math.PI * 0.5f);

		SIN_BITS = 12;
		SIN_MASK = ~(-1 << SIN_BITS);
		SIN_MASK2 = SIN_MASK >> 1;
		SIN_COUNT = SIN_MASK + 1;
		SIN_COUNT2 = SIN_MASK2 + 1;

		radFull = (float) (Math.PI * 2.0);
		degFull = (float) (360.0);
		radToIndex = SIN_COUNT / radFull;
		degToIndex = SIN_COUNT / degFull;

		sinFull = new float[SIN_COUNT];
		for (int i = 0; i < SIN_COUNT; i++) {
			sinFull[i] = (float) Math.sin((i + Math.min(1, i % (SIN_COUNT / 4)) * 0.5) / SIN_COUNT * radFull);
		}

		sinHalf = new float[SIN_COUNT2];
		for (int i = 0; i < SIN_COUNT2; i++) {
			sinHalf[i] = (float) Math.sin((i + Math.min(1, i % (SIN_COUNT / 4)) * 0.5) / SIN_COUNT * radFull);
		}
	}

	//

	public static final float sinFull(float rad) {
		return sinFull[(int) (rad * radToIndex) & SIN_MASK];
	}

	public static final float cosFull(float rad) {
		return sinFull(rad + SIN_TO_COS);
	}

	//

	public static final float sinHalf(float rad) {
		int index1 = (int) (rad * radToIndex) & SIN_MASK;
		int index2 = index1 & SIN_MASK2;
		// int mul = (((index1 - index2) >>> 31) << 1) + 1;
		int mul = ((index1 == index2) ? +1 : -1);
		return sinHalf[index2] * mul;
	}

	public static final float cosHalf(float rad) {
		return sinHalf(rad + SIN_TO_COS);
	}

	//

	public static final float sinTaylor(float rad) {
		// TODO: clamp to -PI..+PI

		double x = rad;

		double x2 = x * x;
		double x3 = x2 * x;
		double x5 = x2 * x3;
		double x7 = x2 * x5;
		double x9 = x2 * x7;
		double x11 = x2 * x9;
		double x13 = x2 * x11;
		double x15 = x2 * x13;
		double x17 = x2 * x15;

		double val = x;
		val -= x3 * 0.16666666666666666666666666666667;
		val += x5 * 0.00833333333333333333333333333333;
		val -= x7 * 1.984126984126984126984126984127e-4;
		val += x9 * 2.7557319223985890652557319223986e-6;
		val -= x11 * 2.5052108385441718775052108385442e-8;
		val += x13 * 1.6059043836821614599392377170155e-10;
		val -= x15 * 7.6471637318198164759011319857881e-13;
		val += x17 * 2.8114572543455207631989455830103e-15;
		return (float) val;
	}

	public static final float cosTaylor(float rad) {
		return sinTaylor(rad + SIN_TO_COS);
	}
}

relminator · June 11, 2013, 2:13pm

Slightly offtopic:

Could luts being faster in java be attributed to how Math.xxx behave? Ie. returning doubles instead of floats?

I remember when I coded this 256 byte demo:

www.pouet.net/prod.php?which=24821

That st0 returns a 32bit float. I’m not sure about the newer pc architectures though.

Roquen · June 11, 2013, 2:35pm

@Riven: exactly. If you don’t need continuous sin/cos and you’re intensely calling the table-based function then the big hit comes up front and after that table will be (on newer cards) all in L1. You have a ~5 cycle latency to get a 32-bit out of L1 into a register which will likely to be hidden behind other ALU ops. So win in that situation. WRT the taylor series, you can get within a 3-ulp accurate result with about 6 muls using mini-max, ignoring range reduction and you’re benchmark version isn’t formed as Horner, so you’re already doing twice as many muls as needed and have a very long dependency chain.

Riven · June 11, 2013, 2:38pm

I actually reduced the dependency chain, but HotSpot didn’t like it, and produced slower native code.

Oskuro · June 11, 2013, 2:41pm

Actually, I understand that. My problem is understanding why the MASK works (Yes, extreme ignorance of binary arithmetics, sorry).

For example, in my game I’m simplifying angles so there are only 16 possible options, so my mask is built like:


int MASK = ~(-1 << 4);

// ...

public static double lut_sin(int value) { return sin_lut[value & MASK]; }

Which means that whatever the integer value I punch in, I get and index value between 0 and 15 (sorry if the code doesn’t match this assumption, I don’t have the actual working source at hand now).

So I’m shifting -1 four bits, and then negating it. I get that. My guess is that shifting the -1 is to have the sign bit into the resulting value.

What I don’t understand is how that results in the angle value being parsed correctly despite it being negative or larger than the boundary.

And yes, I’m experimenting and coming to a conclusion myself, but wouldn’t mind some help, I love these bit-arithmetic tricks

Roquen · June 11, 2013, 2:47pm

@Riven: interesting. Can you spew out the asm?