Fast Atan2 [Again]

mooman219 · August 30, 2017, 10:58am

Background:
Back at it with some Atan2 benchmarks. I found the source for my old benchmarks and wanted to run this again with some different settings. The purpose of this post is to find a faster way to do Math.atan2(y,x). To compare various methods, I setup a simple JMH benchmark and accuracy test. Criticisms and new atan2 submissions are appreciated.

[u]Outcome:[/u] Fastest implementation with lookup table: [b]Icecore[/b] Fastest implementation without lookup table: [b]Diamond[/b]

Note: Fastest doesn’t mean most accurate. Please take into account the average and largest error for each algorithm.

[u]Relevant Data:[/u]

A lower average means higher accuracy.
Apache       : Average Error 0.00000 / Largest Error 0.00000 : 36,012,001 samples : 1488.165 ±   7.285  ms/op
Default      : Average Error 0.00000 / Largest Error 0.00000 : 36,012,001 samples : 1773.411 ± 193.228  ms/op
Diamond      : Average Error 0.04318 / Largest Error 0.07112 : 36,012,001 samples :   47.299 ±   1.467  ms/op
DSPAccurate  : Average Error 0.00344 / Largest Error 0.01015 : 36,012,001 samples :   72.539 ±   3.702  ms/op
DSPFast      : Average Error 0.04318 / Largest Error 0.07111 : 36,012,001 samples :   56.547 ±   0.662  ms/op
Icecore      : Average Error 0.00038 / Largest Error 0.00098 : 36,012,001 samples :   58.265 ±   0.686  ms/op
Kappa        : Average Error 0.00224 / Largest Error 0.00468 : 36,012,001 samples :   67.622 ±  21.004  ms/op
Riven        : Average Error 0.00291 / Largest Error 0.00787 : 36,012,001 samples :   99.701 ±   2.846  ms/op
Baseline     :                                               : 36,012,001 samples :   36.155 ±   4.251  ms/op

Note: Keep in mind these are micro benchmarks and not indicative of real world performance; the JMH can only do so much to make sure the JIT doesn’t interfere. There may be a fair amount of branch prediction and caching that’s going on, and lookup tables might just be sitting in a cpu cache.

[u]Source:[/u]

public class Atan2 {

    ///////////////////////////////////////
    // Benchmark Settings
    ///////////////////////////////////////

    private static final double LOW_D = -3;
    private static final double HIGH_D = 3;
    private static final double INC_D = 0.001D;
    private static final float LOW_F = (float) LOW_D;
    private static final float HIGH_F = (float) HIGH_D;
    private static final float INC_F = (float) INC_D;

    public static Options buildOptions() {
        return new OptionsBuilder()
                .include(Atan2.class.getSimpleName())
                .warmupTime(new TimeValue(2, TimeUnit.SECONDS))
                .warmupIterations(4)
                .measurementTime(new TimeValue(2, TimeUnit.SECONDS))
                .measurementIterations(4)
                .forks(1)
                .mode(Mode.AverageTime)
                .timeUnit(TimeUnit.MILLISECONDS)
                .threads(1)
                .build();
    }

    ///////////////////////////////////////
    // Accuracy
    ///////////////////////////////////////

    public static void printError() {
        Atan2Benchmark[] benchmarks = new Atan2Benchmark[]{
            new Apache(),
            new Default(),
            new Diamond(),
            new DSPAccurate(),
            new DSPFast(),
            new Icecore(),
            new Kappa(),
            new Riven(),};

        System.out.println(String.format("A lower average means higher accuracy."));
        for (Atan2Benchmark benchmark : benchmarks) {
            benchmark.printError(LOW_D, HIGH_D, INC_D);
        }
    }

    ///////////////////////////////////////
    // Baseline
    ///////////////////////////////////////

    @Benchmark
    public double baseline() {
        double sum = 0;
        for (double x = LOW_D; x < HIGH_D; x += INC_D) {
            for (double y = LOW_D; y < HIGH_D; y += INC_D) {
                sum += x + y;
            }
        }
        return sum;
    }

    ///////////////////////////////////////
    // Default atan2
    ///////////////////////////////////////

    public static final class Default extends Atan2Benchmark {

        public Default() {
            super("Default");
        }

        @Override
        public float test(float x, float y) {
            return (float) Math.atan2(x, y);
        }
    }

    @Benchmark
    public double atan2_default() {
        double sum = 0;
        for (double x = LOW_D; x < HIGH_D; x += INC_D) {
            for (double y = LOW_D; y < HIGH_D; y += INC_D) {
                sum += Math.atan2(x, y);
            }
        }
        return sum;
    }

    ///////////////////////////////////////
    // Apache's FastMath.atan2 ( http://commons.apache.org/proper/commons-math/javadocs/api-3.3/org/apache/commons/math3/util/FastMath.html )
    ///////////////////////////////////////

    public static final class Apache extends Atan2Benchmark {

        public Apache() {
            super("Apache");
        }

        @Override
        public float test(float x, float y) {
            return (float) FastMath.atan2(x, y);
        }
    }

    @Benchmark
    public double atan2_apache() {
        double sum = 0;
        for (double x = LOW_D; x < HIGH_D; x += INC_D) {
            for (double y = LOW_D; y < HIGH_D; y += INC_D) {
                sum += FastMath.atan2(x, y);
            }
        }
        return sum;
    }

    ///////////////////////////////////////
    // Riven's atan2 ( http://www.java-gaming.org/index.php?topic=14647.0 )
    ///////////////////////////////////////

    public static final class Riven extends Atan2Benchmark {

        public Riven() {
            super("Riven");
        }

        @Override
        public float test(float x, float y) {
            return Riven.atan2(x, y);
        }

        private static final int ATAN2_BITS = 7;

        private static final int ATAN2_BITS2 = ATAN2_BITS << 1;
        private static final int ATAN2_MASK = ~(-1 << ATAN2_BITS2);
        private static final int ATAN2_COUNT = ATAN2_MASK + 1;
        private static final int ATAN2_DIM = (int) Math.sqrt(ATAN2_COUNT);

        private static final float INV_ATAN2_DIM_MINUS_1 = 1.0f / (ATAN2_DIM - 1);

        private static final float[] atan2 = new float[ATAN2_COUNT];

        static {
            for (int i = 0; i < ATAN2_DIM; i++) {
                for (int j = 0; j < ATAN2_DIM; j++) {
                    float x0 = (float) i / ATAN2_DIM;
                    float y0 = (float) j / ATAN2_DIM;
                    atan2[j * ATAN2_DIM + i] = (float) Math.atan2(y0, x0);
                }
            }
        }

        public static final float atan2(float y, float x) {
            float add, mul;

            if (x < 0.0f) {
                if (y < 0.0f) {
                    x = -x;
                    y = -y;

                    mul = 1.0f;
                } else {
                    x = -x;
                    mul = -1.0f;
                }
                add = -3.141592653f;
            } else {
                if (y < 0.0f) {
                    y = -y;
                    mul = -1.0f;
                } else {
                    mul = 1.0f;
                }
                add = 0.0f;
            }

            float invDiv = 1.0f / (((x < y) ? y : x) * INV_ATAN2_DIM_MINUS_1);

            int xi = (int) (x * invDiv);
            int yi = (int) (y * invDiv);

            return (atan2[yi * ATAN2_DIM + xi] + add) * mul;
        }
    }

    @Benchmark
    public float atan2_riven() {
        float sum = 0;
        for (float x = LOW_F; x < HIGH_F; x += INC_F) {
            for (float y = LOW_F; y < HIGH_F; y += INC_F) {
                sum += Riven.atan2(x, y);
            }
        }
        return sum;
    }

    ///////////////////////////////////////
    // DSP's atan2 ( http://dspguru.com/dsp/tricks/fixed-point-atan2-with-self-normalization )
    ///////////////////////////////////////

    public static final class DSPFast extends Atan2Benchmark {

        public DSPFast() {
            super("DSPFast");
        }

        @Override
        public float test(float x, float y) {
            return DSPFast.atan2(x, y);
        }

        private static final float PI_2 = 1.5707963F; // Math.PI / 2
        private static final float PI_4 = 0.7853982F; // Math.PI / 4
        private static final float PI_3_4 = 2.3561945F; // (Math.PI / 4) * 3
        private static final float MINUS_PI_2 = -1.5707963F; // Math.PI / -2

        public static final float atan2(float y, float x) {
            float r;
            float abs_y = Float.intBitsToFloat(Float.floatToRawIntBits(y) << 1 >>> 1);
            if (x == 0.0F) {
                if (y > 0.0F) {
                    return PI_2;
                }
                if (y == 0.0F) {
                    return 0.0f;
                }
                return MINUS_PI_2;
            } else if (x > 0) {
                r = (x - abs_y) / (x + abs_y);
                r = PI_4 - PI_4 * r;
            } else {
                r = (x + abs_y) / (abs_y - x);
                r = PI_3_4 - PI_4 * r;
            }
            return y < 0 ? -r : r;
        }
    }

    public static final class DSPAccurate extends Atan2Benchmark {

        public DSPAccurate() {
            super("DSPAccurate");
        }

        @Override
        public float test(float x, float y) {
            return DSPAccurate.atan2(x, y);
        }

        private static final float PI_2 = 1.5707963F; // Math.PI / 2
        private static final float PI_4 = 0.7853982F; // Math.PI / 4
        private static final float PI_3_4 = 2.3561945F; // (Math.PI / 4) * 3
        private static final float MINUS_PI_2 = -1.5707963F; // Math.PI / -2

        public static final float atan2(float y, float x) {
            float r;
            float c;
            float abs_y = Float.intBitsToFloat(Float.floatToRawIntBits(y) << 1 >>> 1);
            if (x == 0.0F) {
                if (y > 0.0F) {
                    return PI_2;
                }
                if (y == 0.0F) {
                    return 0.0f;
                }
                return MINUS_PI_2;
            } else if (x > 0) {
                r = (x - abs_y) / (x + abs_y);
                c = PI_4;
            } else {
                r = (x + abs_y) / (abs_y - x);
                c = PI_3_4;
            }
            r = 0.1963F * r * r * r - 0.9817F * r + c;
            return y < 0 ? -r : r;
        }
    }

    @Benchmark
    public float atan2_dsp_fast() {
        float sum = 0;
        for (float x = LOW_F; x < HIGH_F; x += INC_F) {
            for (float y = LOW_F; y < HIGH_F; y += INC_F) {
                sum += DSPFast.atan2(x, y);
            }
        }
        return sum;
    }

    @Benchmark
    public float atan2_dsp_accurate() {
        float sum = 0;
        for (float x = LOW_F; x < HIGH_F; x += INC_F) {
            for (float y = LOW_F; y < HIGH_F; y += INC_F) {
                sum += DSPAccurate.atan2(x, y);
            }
        }
        return sum;
    }

    ///////////////////////////////////////
    // kappa's atan2 ( http://www.java-gaming.org/topics/extremely-fast-atan2/36467/msg/346112/view.html#msg346112 )
    ///////////////////////////////////////

    public static final class Kappa extends Atan2Benchmark {

        public Kappa() {
            super("Kappa");
        }

        @Override
        public float test(float x, float y) {
            return Kappa.atan2(x, y);
        }

        private static final float PI = 3.1415927f;
        private static final float PI_2 = PI / 2f;
        private static final float MINUS_PI_2 = -PI_2;
        private static final float C_M = 0.2808722f;
        private static final float C_A = 0.2808722f;

        public static final float atan2(float y, float x) {
            if (x == 0.0f) {
                if (y > 0.0f) {
                    return PI_2;
                }
                if (y == 0.0f) {
                    return 0.0f;
                }
                return MINUS_PI_2;
            }

            final float atan;
            final float z = y / x;
            if (Math.abs(z) < 1.0f) {
                atan = z / (z * z * C_M + 1.0f);
                if (x < 0.0f) {
                    return (y < 0.0f) ? atan - PI : atan + PI;
                }
                return atan;
            } else {
                atan = PI_2 - z / (z * z + C_A);
                return (y < 0.0f) ? atan - PI : atan;
            }
        }
    }

    @Benchmark
    public float atan2_kappa() {
        float sum = 0;
        for (float x = LOW_F; x < HIGH_F; x += INC_F) {
            for (float y = LOW_F; y < HIGH_F; y += INC_F) {
                sum += Kappa.atan2(x, y);
            }
        }
        return sum;
    }

    ///////////////////////////////////////
    // Icecore's atan2 ( http://www.java-gaming.org/topics/extremely-fast-atan2/36467/msg/346145/view.html#msg346145 )
    ///////////////////////////////////////

    public static final class Icecore extends Atan2Benchmark {

        public Icecore() {
            super("Icecore");
        }

        @Override
        public float test(float x, float y) {
            return Icecore.atan2(x, y);
        }

        private static final float PI = (float) Math.PI;
        private static final float PI_2 = PI / 2;
        private static final float PI_NEG_2 = -PI_2;
        private static final int SIZE = 1024;
        private static final float ATAN2[];

        static {
            final int Size_Ar = SIZE + 1;
            ATAN2 = new float[Size_Ar];
            for (int i = 0; i <= SIZE; i++) {
                double d = (double) i / SIZE;
                double x = 1;
                double y = x * d;
                ATAN2[i] = (float) Math.atan2(y, x);
            }
        }

        public static final float atan2(float y, float x) {
            if (y < 0) {
                if (x < 0) {
                    // (y < x) because == (-y > -x)
                    if (y < x) {
                        return PI_NEG_2 - ATAN2[(int) (x / y * SIZE)];
                    } else {
                        return ATAN2[(int) (y / x * SIZE)] - PI;
                    }
                } else {
                    y = -y;
                    if (y > x) {
                        return ATAN2[(int) (x / y * SIZE)] - PI_2;
                    } else {
                        return -ATAN2[(int) (y / x * SIZE)];
                    }
                }
            } else {
                if (x < 0) {
                    x = -x;
                    if (y > x) {
                        return PI_2 + ATAN2[(int) (x / y * SIZE)];
                    } else {
                        return PI - ATAN2[(int) (y / x * SIZE)];
                    }
                } else {
                    if (y > x) {
                        return PI_2 - ATAN2[(int) (x / y * SIZE)];
                    } else {
                        return ATAN2[(int) (y / x * SIZE)];
                    }
                }
            }
        }
    }

    @Benchmark
    public float atan2_icecore() {
        float sum = 0;
        for (float x = LOW_F; x < HIGH_F; x += INC_F) {
            for (float y = LOW_F; y < HIGH_F; y += INC_F) {
                sum += Icecore.atan2(x, y);
            }
        }
        return sum;
    }

    ///////////////////////////////////////
    // Diamond's atan2 ( https://stackoverflow.com/questions/1427422/cheap-algorithm-to-find-measure-of-angle-between-vectors/14675998#14675998 )
    ///////////////////////////////////////

    public static final class Diamond extends Atan2Benchmark {

        public Diamond() {
            super("Diamond");
        }

        @Override
        public float test(float x, float y) {
            return Diamond.atan2(x, y);
        }

        private static final float PI_2 = 1.5707963F; // Math.PI / 2

        public static final float atan2(float y, float x) {
            float angle;
            if (y == 0f && x >= 0f) {
                return 0;
            } else if (y >= 0f) {
                if (x >= 0f) {
                    angle = y / (x + y);
                } else {
                    angle = 1f - x / (-x + y);
                }
            } else {
                if (x < 0f) {
                    angle = -2f + y / (x + y);
                } else {
                    angle = -1f + x / (x - y);
                }
            }
            return angle * PI_2;
        }
    }

    @Benchmark
    public float atan2_diamond() {
        float sum = 0;
        for (float x = LOW_F; x < HIGH_F; x += INC_F) {
            for (float y = LOW_F; y < HIGH_F; y += INC_F) {
                sum += Diamond.atan2(x, y);
            }
        }
        return sum;
    }
}

[u]Updates:[/u] 1: Added baseline benchmark. Cleaned up DSPFast/DSPAccurate implementation to look more sane with no performance loss. 2: Improved Kappa's accuracy with no performance loss. Updated Icecore's submission. Retested all algorithms. 3: Added the Diamond implementation.

Icecore · August 30, 2017, 6:07pm

I also have this - it’s more Cache friendly, but have more OPs ^^


private static final float Pi = (float) Math.PI;
private static final float Pi_H = Pi / 2;
//4 x 4000 = 16 kb L1 Cache
private static final int Size_Ac = 4000;
private static final float Atan2[];

static{
	final int Size_Ar = Size_Ac + 1;
	Atan2 = new float[Size_Ar];
	for(int i = 0; i <= Size_Ac; i++){
		double d = (double) i / Size_Ac;
		double x = 1;
		double y = x * d;
		Atan2[i] = (float) Math.atan2(y, x);
	}
}

static public final float atan2_1(float y, float x){
	if(y < 0){
		if(x < 0){
			// (y < x) because == (-y > -x)
			if(y < x) return -(Pi_H + Atan2[(int) (x / y * Size_Ac)]);
			else return Atan2[(int) (y / x * Size_Ac)] - Pi;
		}
		else{
			y = -y;
			if(y > x) return Atan2[(int) (x / y * Size_Ac)] - Pi_H;
			else return -Atan2[(int) (y / x * Size_Ac)];
		}
	}
	else{
		if(x < 0){
			x = -x;
			if(y > x) return Pi_H + Atan2[(int) (x / y * Size_Ac)];
			else return Pi - Atan2[(int) (y / x * Size_Ac)];
		}
		else{
			if(y > x) return Pi_H - Atan2[(int) (x / y * Size_Ac)];
			else return Atan2[(int) (y / x * Size_Ac)];
		}
	}
}

mooman219 · August 30, 2017, 7:11pm

@Icecore
It appears to be slower.

Benchmark                  Mode  Cnt   Score   Error  Units
ATAN2.atan2_icecore        avgt    4  53.079 ± 0.157  ms/op
ATAN2.atan2_icecore_cache  avgt    4  58.945 ± 2.317  ms/op
ATAN2.baseline             avgt    4  34.576 ± 0.274  ms/op

Source: https://gist.github.com/mooman219/7a881b2d043354d44036e0df35d1e0b5

Icecore · September 1, 2017, 9:28am

It have more accuracy because use bigger array and use less memory because its 1 array – not 8 )
(1000 x 8 x 4(f) = 32kb 4000 x 4(f) = 16 kb)
More array size more accuracy ^^

also may try this - less on one OP =)


++		private static final float Pi_H_M = -Pi_H;
--		if(y < x) return -(Pi_H + Atan2[(int) (x / y * Size_Ac)]);
++		if(y < x) return Pi_H_M - Atan2[(int) (x / y * Size_Ac)];

mooman219 · September 9, 2017, 12:49am

I added your changes and updated the benchmark @icecore. I also improved the accuracy of Kappa’s atan2. The 0.28 constant seemed suspicious, so I played around with it, getting a lower average error and lower largest error with the constant 0.2808722 instead.

mooman219 · September 9, 2017, 7:54am

Had to fix some issues, but if you’re willing to deal with an average error of 2.4 degrees, and a max error of 4.07 degrees, then boy do I have the algorithm for you. Adapted from “encoding 2d angles without trigonometry”, I bring you the diamond method. It’s the fastest by far.

private static final float PI_2 = 1.5707963F; // Math.PI / 2

public static final float atan2(float y, float x) {
	float angle;
	if (x == 0f && y == 0f) {
		return 0;
	}
	if (y >= 0f) {
		if (x >= 0f) {
			angle = y / (x + y);
		} else {
			angle = 1f - x / (-x + y);
		}
	} else {
		if (x < 0f) {
			angle = -y / (-x - y) - 2f;
		} else {
			angle = x / (x - y) - 1f;
		}
	}
	return angle * PI_2;
}

Roquen · September 9, 2017, 11:38am

Broken record time. Any benchmark that doesn’t reflect an actual usage pattern is wishful thinking. If you’re linearly walking through angles you can compute with tiny errors in a couple of products and adds (I’m tired and was thinking forward here, opps)…no lookups no branches.

mooman219 · September 9, 2017, 8:49pm

Completely fair and expected. I first made the post with the disclaimer below the results that this is a microbenchmark and has caveats. A LUT could improve the performance of your algorithm, but create some cache pollution that slows down other aspects of your program. LUTs aside, this post isn’t a comprehensive search for the best Atan2, it takes a couple and tests some metrics on them. If you’re writing a game, this will let you swap in various algorithms and you can see how they perform. It’ll hopefully give you some reassurance that you’re using a reasonable algorithm when compared to others.

Roquen · September 11, 2017, 9:00am

My expectation is that the branches will be the much bigger problem in most cases. If you’re making enough atan/atan2 calls to be statistically important then it’s pretty likely they will bunch up in time and the cost of loading the LUT to cache won’t be a killer. For the branches it’s very unlikely that the probability of each won’t be in the neighborhood of .5 so ~n/2 branch mispredictions per call where ‘n’ is the number of branches along a path.

EDIT: The way the benchmark is structured means the branch will statistically always be correctly predicted.

Icecore · September 12, 2017, 10:04am

Let make next Contest “Fast Sqrt”
one from the Baselines https://en.wikipedia.org/wiki/Fast_inverse_square_root


float Q_rsqrt( float number )
{
	long i;
	float x2, y;
	const float threehalfs = 1.5F;

	x2 = number * 0.5F;
	y  = number;
	i  = * ( long * ) &y;                       // evil floating point bit level hacking
	i  = 0x5f3759df - ( i >> 1 );               // what the fuck? 
	y  = * ( float * ) &i;
	y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//	y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

	return y;
}

kappa · September 12, 2017, 10:52am

From my tests in the past, Java’s Math.sqrt() is pretty damn fast these days, only worth it if you can beat that method substantially.

Abuse · September 12, 2017, 1:03pm

My understanding is that the above approximation is no-longer relevant 'cos we have x86 instructional support (in various forms).

Icecore · September 12, 2017, 4:44pm

Yes, ok, my mistake - why i even trying…
[spoiler]– no one cares)[/spoiler]

SiestaGuru · December 6, 2017, 11:01am

These are the fastest I could find, they’re slight modifications of algos made by others and found on this website.
Basically I removed a couple of tiny unnecessary calculations, like unneeded assignments and taking a negative of some value when it could just as easily be restructured to not need to do that.

With lookup table, change the SIZE for higher precision if desired.
At the current Size of 2520, the average error is ~0.0000819 max error ~0.000489
Dies at x=0,y=0 without an extra check


   private static final int  SIZE = 2520; 

  //Based on a method by Zom-B
      public static final float atan2(float y, float x){
        if (x >= 0)
        {
            if (y >= 0)
            {
                if (x >= y)
                     //Add a check for x=0,y=0 here if needed
                    return ATAN2_TABLE_PPY[(int)(SIZE * y / x)];
                else
                    return ATAN2_TABLE_PPX[(int)(SIZE * x / y)];
            }
            else
            {
                if (x >= -y)
                    return ATAN2_TABLE_PNY[(int)(EZIS * y / x)];
                else
                    return ATAN2_TABLE_PNX[(int)(EZIS * x / y)];
            }
        }
        else if (y >= 0)
        {
            if (-x >= y)
                return ATAN2_TABLE_NPY[(int)(EZIS * y / x)];
            else
                return ATAN2_TABLE_NPX[(int)(EZIS * x / y)];
        }
        else
        {
            if (x <= y) 
                return ATAN2_TABLE_NNY[(int)(SIZE * y / x)];
            else
                return ATAN2_TABLE_NNX[(int)(SIZE * x / y)];
        }
    }

private static final float PI = (float)Math.PI;


//If the static initiation is too much, you can also pre-calculate arrays and just hardcode the values in.

private static final float        STRETCH            = PI;
private static final float        NEGATIVE_STRETCH            = -PI;
private static final int        EZIS            = -SIZE;
private static final float[]    ATAN2_TABLE_PPY    = new float[SIZE + 1];
private static final float[]    ATAN2_TABLE_PPX    = new float[SIZE + 1];
private static final float[]    ATAN2_TABLE_PNY    = new float[SIZE + 1];
private static final float[]    ATAN2_TABLE_PNX    = new float[SIZE + 1];
private static final float[]    ATAN2_TABLE_NPY    = new float[SIZE + 1];
private static final float[]    ATAN2_TABLE_NPX    = new float[SIZE + 1];
private static final float[]    ATAN2_TABLE_NNY    = new float[SIZE + 1];
private static final float[]    ATAN2_TABLE_NNX    = new float[SIZE + 1];
static
{
    for (int i = 0; i <= SIZE; i++)
    {
        float f = (((float)i) + 0.5f) / SIZE;  //the + 0.5f is to account for the fact that the int cast in the atan2 function rounds down
        ATAN2_TABLE_PPY[i] = (float)(StrictMath.atan(f) * STRETCH / PI);
        ATAN2_TABLE_PPX[i] = STRETCH * 0.5f - ATAN2_TABLE_PPY[i];
        ATAN2_TABLE_PNY[i] = -ATAN2_TABLE_PPY[i];
        ATAN2_TABLE_PNX[i] = ATAN2_TABLE_PPY[i] + NEGATIVE_STRETCH * 0.5f;
        ATAN2_TABLE_NPY[i] = STRETCH - ATAN2_TABLE_PPY[i];
        ATAN2_TABLE_NPX[i] = ATAN2_TABLE_PPY[i] + STRETCH * 0.5f;
        ATAN2_TABLE_NNY[i] = ATAN2_TABLE_PPY[i] - STRETCH;
        ATAN2_TABLE_NNX[i] = NEGATIVE_STRETCH * 0.5f - ATAN2_TABLE_PPY[i];
    }
}

And without lookup table. This is faster, but rather inaccurate with an average error of ~0.021, max error for general usage ~0.071. However, if x and y are both 0 it’ll die.


private static final float PI = (float)Math.PI;
private static final float PIOVER4 = PI / 4f;
private static final float PITHREEQUARTERS = 3f * PIOVER4;

   //Based on https://dspguru.com/dsp/tricks/fixed-point-atan2-with-self-normalization/
    public static final float atan2(float y, float x) {
        if (y >=0) {
            if (x >= 0) {
                //Add a check for x=0,y=0 here if needed
                return PIOVER4 - PIOVER4 * ((x - y) / (x + y));
            } else {
                return PITHREEQUARTERS - PIOVER4 * ((x + y) / (y - x));
            }
        } else if (x >= 0) {
            return PIOVER4 * ((x + y) / (x - y)) - PIOVER4;
        } else {
            return PIOVER4 * ((y - x) / (y + x)) - PITHREEQUARTERS;
        }
    }

If you want to go really extreme, there may also be a tiny possible optimization which will be hardware dependent. On some architectures subtraction can take a tiny amount more than addition afaik. So instead of subtracting PIOVER4, it may be worth trying + NEGATIVE_PIOVER4. Though there may be a downside in that caching gets harder with 2 variables.

Icecore · December 6, 2017, 9:40pm

I f**** can’t believe….

How this even possible Its very weird
http://www.java-gaming.org/topics/13-8x-faster-atan2-updated/14647/view.html

because I develop same code after 5 years
(after many iterations)
– I never see it or something similar
even


my
//(y < x) because == (-y > -x)
if (y < x)

Zom-B
if (x <= y) // (-x >= -y)

/////
if its was best solution why it was not included to test from start?
or maybe its fake (forum modify???) - not looks so

I even don’t know…

Or maybe its time Travelers XD
Zom-B – You cool, cheers, you’re the best =)
Maybe its even me? ^^ (if its really me - Zom-B : ha-ha I understand the humor behind it XDDDD)

p.s
its my last post
I leave this name and all previous (same as forum)
it was decision long before this accident (but this – show that I do all right)
Maybe we see again – or maybe not, who knows

SiestaGuru · December 6, 2017, 11:31pm

Haha, it does look quite similar. I was wondering if one of you had taken some bits off the other. I suppose it’s just that doing a structure like that is sort of a natural conclusion to reach because it limits the amount of operations you need to do. Great minds think alike?
Mine ends up being a bit faster for my test case than yours, but I imagine it’s possible that it may depend on the JIT compiler and architecture because while it saves on operations, it needs more arrays. I haven’t tested whether Zom-B’s version is faster, it might not be since it does an unnecessary +0.5 which could just be baked into the array generator. I did notice that your version didn’t account for the (int) cast rounding down, there’s quite a bit of an average and max accuracy difference so it’s probably worth doing.

Oh yeah for Math.Sqrt, it’s almost unbeatable, but I found two approximation methods that are faster.
One is just a simple lookup table for the expected range of input values with a small check for anything above if you think the value can exceed the range.
Make sure to account for the int cast rounding down in the table generation again


    //Uses a lookup table with 3600 elements
    public static float sqrt(float x) {
        if(x >= 100f){
            return (float)Math.sqrt(x);
        }else{
            return sqrtTable[(int)(x*36f)];
        }
    }

The others are these abominations, which don’t require tables and are slightly faster but have worse accuracy.
See http://h14s.p5r.org/2012/09/0x5f3759df.html for an explanation
For at least the inverted one you can add a step of Newton’s method for more accuracy, but doing that makes the speed advantage disappear.


public static float sqrt(float x) {
        return Float.intBitsToFloat(0x1fbd1df5 + (Float.floatToIntBits(x) >> 1));
}
public static float invertedSqrt(float x) {
        return  Float.intBitsToFloat(0x5f3759df - (Float.floatToIntBits(x) >> 1));
}

In both cases my benchmarks give me an about a tenfold speed increase compared to Math.sqrt(). However, Math.sqrt is comparatively so light, that the difference will usually seem way less because things like the benchmark-loop, method calls etc. are very significant. In my original benchmark I got only a 3-fold increase because I hadn’t filtered out those factors.
So the tenfold increase isn’t nearly as significant as it sounds, and in the vast majority of cases, Math.sqrt is more than enough.

Oh and btw, don’t use Math.hypot() to calculate a hypotenuse. Math.sqrt(dxdx + dydy) is waaay faster for whatever reason.

Edit: Corrections on the sqrt methods

princec · December 7, 2017, 9:03am

Math.hypot() is much more accurate (unnecessarily so for our purposes).

Cas

SiestaGuru · December 7, 2017, 12:32pm

Found another one worth considering. This one is based on the Icecore’s method and uses only one table. The difference here is that by using signum, it avoids all but one if/else statement and so can do better at branch prediction with random data.
On some tests, this gets the highest speed of any of the lookup table methods, but does worse on others. Its best environment seems to be a lot of executions on random data. If there’s a low amount of executions or the data is structured, other methods are better.
This happens to work fine for x=0,y=0


    public static final float atan2BranchpredictionHelper(float y, float x) {
        //Basically Math.signum(), except they're slightly faster and can't return 0
        float signX = Float.intBitsToFloat((Float.floatToIntBits(x) & DESTROY_ALL_BUT_SIGN) | SETTO1);
        float signY = Float.intBitsToFloat((Float.floatToIntBits(y) & DESTROY_ALL_BUT_SIGN) | SETTO1);

        if(Math.abs(y) >= Math.abs(x)){
            return signY*HALFPI - signX*signY*ATAN[(int) (Math.abs(x) / Math.abs(y) * SIZE)];
        }else{
            return (signX-1)*HALFPI_NEG*signY + signX*signY*ATAN[(int) (Math.abs(y) / Math.abs(x) * SIZE)];
        }
    }

    static {
        final int DESTROY_ALL_BUT_SIGN = 0b10000000000000000000000000000000;
        final int SETTO1 = Float.floatToIntBits(1f);
        final int  SIZE = 1024;
        final float HALFPI = (float)Math.PI / 2f;
        final float HALFPI_NEG = -HALFPI;
        final int Size_Ar = SIZE + 1;
        ATAN = new float[Size_Ar];
        for (int i = 0; i <= SIZE; i++) {
            ATAN[i] = (float) Math.atan2((((double)i)+0.5) / SIZE, 1);
        }
        ATAN[0] = 0;

    }

For reference, my nanosecs benchmarks for 1 million and 100 million executions of different methods (includes the computations necessary for running through the loop)
The winner changes depending on the execution count, so I have no idea which one would win in a real-world scenario.

This new method seems terrible on low execution counts and is even worse than the default, but somehow becomes the winner at very high execution counts.
Icecore’s takes over my 8-array method at some point (once caching kicks in properly?)
The arrayless version is always fast


10,000:
Math.atan2  1113710  #4                      111.4 ns / calc
Icecore  750496    #3                            75.0  ns / calc    
My 8-array lookup (based on Zom-Bs)  551749   #2       55.2     ns / calc    
This new method    2819284  #5                    281.9  ns / calc    
The array-less approximation   414997    #1        41.5    ns / calc 
Dummy method (return x+y)   247613

1,000,000: 
Math.atan2   87623524  #5                   87.6   ns / calc
Icecore   19311236  #3                      19.3   ns / calc
My 8-array lookup (based on Zom-Bs)  21043795  #4      21.0   ns / calc                
This new method   14324703  #2                 14.3   ns / calc    
The array-less approximation  13263142   #1         13.3   ns / calc  
Dummy method (return x+y) 6273097  

100,000,000:
Math.atan2  8471197796  #5                  84.7   ns / calc
Icecore  1446387298  #3                         14.5   ns / calc
My 8-array lookup (based on Zom-Bs)  1707782769  #4      17.1   ns / calc
This new method  932633308   #1                      9.3  ns / calc
The array-less approximation   1004009940  #2       10.0 ns/calc
Dummy method (return x+y)  138700292               

Errors on 1 million executions, using the same size array (1024) for all lookup tables. The only real difference among the lookup tables is correcting the int cast
Average:
This new method   0.000191705          #1/#2
Icecore   0.00038336442        #3
Arrayless  0.043115597          #4
Eight arrays  0.000191705             #1/#2

Max errors:
This new method  0.0004.8828125        #1/#2
Icecore  0.0009.765129          #3
Arrayless   0.07111478             #4
Eight arrays  0.0004.8828125          #1/#2

Roquen · December 11, 2017, 1:22pm

Yea: hypot avoids overflow & underflow and is significantly more accurate than sqrt of sum of squares…and like princec said it’s very unlikely you care.

Those ‘fast’ square roots are good to about 11 or 12 bits and so are only useful for horseshoes and handgrenades kinda computations. Take Haswell/Broadwell: sqrtss is 1 uOP to issue, completes in 11 cycles and is ulp correct. (Yeah, it’s too bad java doesn’t expose ~sqrt and ~1/sqrt, well and a long list of other things).

Have I ever mentioned that microbenchmarks are only as meaningful as the testing scheme and how well they reflect an an usage?