Extremely Fast atan2

cool, thanks for the benchmark, theres also an implementation for fast sin/cos found here which you might also want to try.

Very important
My benchmarks were flawed and the JIT was doing magic, I updated with more realistic versions. The JIT precomputed and cached the results because the parameter wasn’t changing. The corrected benchmarks references a public variable instead, preventing the JIT from being able to cache the result.
Riven’s is the fastest, kappa is the runner up.

These are great optimisations.

Ordering points around the players’ position in an anti-clockwise direction is an interesting use-case. I made heavy use of atan2 for ordering points in path finding and line of sight algorithms.
But then I heard about an algorithm that can be used to compare points’ relative angles much more quickly called relativeCCW which avoids using trigonometry. It’s in the JDK’s java.awt.Line2D source:
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/awt/geom/Line2D.java#Line2D.relativeCCW(double%2Cdouble%2Cdouble%2Cdouble%2Cdouble%2Cdouble)

Might be interesting to people using atan2 for comparing angles.

Last op ^^


	static final private int Size_Ac = 1000;
	static final private int Size_Ar = Size_Ac + 1;
	static final private float Pi = (float)Math.PI;
	static final private float Pi_H = Pi / 2;
	
	static final private float Atan2[] = new float[Size_Ar];
	static final private float Atan2_PM[] = new float[Size_Ar];
	static final private float Atan2_MP[] = new float[Size_Ar];
	static final private float Atan2_MM[] = new float[Size_Ar];
	
	static final private float Atan2_R[] = new float[Size_Ar];
	static final private float Atan2_RPM[] = new float[Size_Ar];
	static final private float Atan2_RMP[] = new float[Size_Ar];
	static final private float Atan2_RMM[] = new float[Size_Ar];
	static{
		for(int i = 0; i <= Size_Ac; i++){
			double d = (double)i / Size_Ac;
			double x = 1;
			double y = x * d;
			float v = (float)Math.atan2(y, x);
			Atan2[i] = v;
			Atan2_PM[i] = Pi - v;
			Atan2_MP[i] = -v;
			Atan2_MM[i] = -Pi + v;
			
			Atan2_R[i] = Pi_H - v;
			Atan2_RPM[i] = Pi_H + v;
			Atan2_RMP[i] = -Pi_H + v;
			Atan2_RMM[i] = -Pi_H - v;
			// or
			/*
			Atan2[i] = (float)Math.atan2(y, x);
			Atan2_PM[i] = (float)Math.atan2(y, -x);
			Atan2_MP[i] = (float)Math.atan2(-y, x);
			Atan2_MM[i] = (float)Math.atan2(-y, -x);
			
			Atan2_R[i] = (float)Math.atan2(x, y);
			Atan2_RPM[i] = (float)Math.atan2(x, -y);
			Atan2_RMP[i] = (float)Math.atan2(-x, y);
			Atan2_RMM[i] = (float)Math.atan2(-x, -y);
			*/
		}
	}
	static final public float atan2_Op_2(float y, float x){
		if(y < 0){
			if(x < 0){
				//(y < x) because == (-y > -x)
				if(y < x) return Atan2_RMM[(int)(x / y * Size_Ac)];
				else return Atan2_MM[(int)(y / x * Size_Ac)];	
			}
			else{
				y = -y;
				if(y > x) return Atan2_RMP[(int)(x / y * Size_Ac)];
				else return Atan2_MP[(int)(y / x * Size_Ac)];
			}
		}
		else{
			if(x < 0){
				x = -x;
				if(y > x) return Atan2_RPM[(int)(x / y * Size_Ac)];
				else return Atan2_PM[(int)(y / x * Size_Ac)];
			}
			else{
				if(y > x) return Atan2_R[(int)(x / y * Size_Ac)];
				else return Atan2[(int)(y / x * Size_Ac)];
			}
		}
	}

Value check


	static public void main(String[] args){
			int range = 100;
		float ep = (Pi * 2) / (Size_Ac * 6); // pass check
		//float ep = (Pi * 2) / (Size_Ac * 4); // regular
		//float ep = (Pi * 2) / (Size_Ac * 8); // theoretical but result dif on float error val calc
		System.out.println("Accuracy Check: " + ep);
		for(int x = -range; x <= range; x++){
			for(int y = -range; y <= range; y++){
				float a = (float)Math.atan2(x, y);
				float b = atan2_Op_2(x, y);
				float dif = a - b;
				if(Math.abs(dif) > ep){
					System.out.println("Dif x=" + x + " y= " + y);
					System.out.println(a);
					System.out.println(b);
				}
			}
		}
		System.out.println("Accuracy Check End");
	}

I ran the benchmark with the addition of Icecore’s atan2. I’ll be updating the main post shortly. It appears to be in first place by a large margin.

Also, I updated the accuracy measure to actually have somewhat relevant data (As opposed to me just randomly testing a couple values).

Nice)

Most interesting is – in theory this pattern:
Can be used to cache any returns function data with 2 primitive params like float.


         double d = (double)i / Size_Ac;
         double x = 1;
         double y = x * d;
         Ar_PP[i] = fun(y, x);
         Ar_PM[i] = fun(y, -x);
         Ar_MP[i] = fun(-y, x);
         Ar_MM[i] = fun(-y, -x);
         
         Ar_RPP[i] = fun(x, y);
         Ar_RPM[i] = fun(x, -y);
         Ar_RMP[i] = fun(-x, y);
         Ar_RMM[i] = fun(-x, -y);
         
         //return save as atan2

But keep in mind because accuracy have errors,
result may be previous or next cached value.

Example
(1,2,3,4,5) want 3 return 4
(4,0,9,1,2,4) want 9 return 0


         float ep = (Max - Min return val) / (Size_Ac * 4);

On practice all not so easy and usage only on certain condition
but why not try on custom function )
(don’t forget to check results epsilon equal)

@mooman219: Thanks for the comparison! :slight_smile: Could you maybe add the atan2 implementation from the Apache FastMath library? http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/util/FastMath.html#atan2(double,%20double)

It would be really interesting how this performs to the algorithms here.

I’ll include it in the benchmark if you would like. Results so far have it ~400 ns/op, I’ll update the main post shortly.

Edit: Main post is updated. It is about 1.58x faster than the default while maintaining all the accuracy. I’m assuming they had a contract to fulfill in terms of accuracy that prevented the heavy optimization that we’re doing in this thread.

With lookup table: we can increase it size to increase accuracy.


static final private int Size_Ac = 1000000;

1000000 give “Largest Error 0,000001”
(1000000 = 4mb float[])

  • This also can be used to decrease memory usage with low performance decrease

Atan2_RMP[i] = -Pi_H + v;

Atan2_RMP[(int)(x / y * Size_Ac)];
==
-Pi_H + Atan2[(int)(x / y * Size_Ac)];

p.s increase sin, cos lookup table size can also work with Riven lookup table.

[quote=“mooman219,post:28,topic:55147”]
Wow! Thanks! That’s actually slower than I thought… Time to switch to a faster algorithm I guess :wink:

@klaus
If you’re looking for the same accuracy, Icecore’s is very nice as denoted by the main post. I have the accuracy set to 100000 which warrants a max error of 0.0001 . It also performs the best.

I’m right now in my IDE doing a search and replace to use Icecore’s algo from now on :point:

Thanks a lot for this thread! Can really use the spare CPU cycles :wink:

Table-lookup are almost always slow (as in much worse than using the default) in real use cases. Especially as table sizes get large.

Then we must implement the… final solution [Sarcasm].
We use a simple loop to generate the corresponding switch statement with the thousands of elements and paste it into a static method. You’ll avoid bounds checking and the caching problem if it’s compiled to a tableswitch. Have fun loading that monster of text in your ide though.

I’ll try it in the morning and see what we get, even though it sounds extremely silly. I will use icecore’s method with the table set to 1000 elements.

Don’t bother. You can’t get meaningful data. People should just know that approximations that use anything other than a few look-ups will always be shown in approaching best case or unrealizable performance. They gotta test it in place to know if it’s a win or not.

If you’re bored try out this one: https://github.com/roquendm/JGO-Grabbag/blob/master/src/roquen/math/Float32.java

It’ll be shown in approaching worst case performance.

A switch statement is still a lookup-table behind the scenes. It’s all memory - whether that is ‘data’ or ‘code’ is irrelevant.

@Riven
Yeah, it was actually really slow to use the switch statement. I however have a different approach I think I’ll try.
Given these benchmarks are often unrealistic because often the array can fit in cache, why not just allocate the array offheap, get the base address / size of each element, then just use dead reckoning to pull the value. If anything it should be faster as it avoids bounds checks, Unsafe should get compiled down by the jit, and it doesn’t have to load the array.

I think you don’t quite understand yet how memory/caching works. :slight_smile: The JVM doesn’t load ‘the array’ into CPU cache at all, it merely calculates a memory address and tells the CPU to retreive the value at that address. The CPU ‘notices’ memory access patterns and pulls entire pages (4K) from your system memory, moving it through L3, L2 and L1 cache. Whether that memory access is caused by a field access, an array access or by Unsafe, is all the same to the CPU.

The bounds check is very likely to be removed entirely as the masking basically does that implicitly.