13.8x faster atan2 (updated)

Hi folks,

I’ve created an algorithm that allows atan2 to be calculated through a lookup-table.

All values for Y and X are allowed. You can control the RAM-usage/accuracy by setting the bits used in both dimensions.

I used 7 bits per dimension - that means 2^7 (128) values for each dimension: so 128*128 = 16k entries in lookup-table
this results in an accuracy of 0,0078125 (1.0 / 128) in each dimension after the coords are normalized

Updated


   private static final int ATAN2_BITS = 7;

   private static final int ATAN2_BITS2 = ATAN2_BITS << 1;
   private static final int ATAN2_MASK = ~(-1 << ATAN2_BITS2);
   private static final int ATAN2_COUNT = ATAN2_MASK + 1;
   private static final int ATAN2_DIM = (int) Math.sqrt(ATAN2_COUNT);

   private static final float INV_ATAN2_DIM_MINUS_1 = 1.0f / (ATAN2_DIM - 1);
   private static final float DEG = 180.0f / (float) Math.PI;

   private static final float[] atan2 = new float[ATAN2_COUNT];



   static
   {
      for (int i = 0; i < ATAN2_DIM; i++)
      {
         for (int j = 0; j < ATAN2_DIM; j++)
         {
            float x0 = (float) i / ATAN2_DIM;
            float y0 = (float) j / ATAN2_DIM;

            atan2[j * ATAN2_DIM + i] = (float) Math.atan2(y0, x0);
         }
      }
   }


   /**
    * ATAN2
    */

   public static final float atan2Deg(float y, float x)
   {
      return FastMath.atan2(y, x) * DEG;
   }

   public static final float atan2DegStrict(float y, float x)
   {
      return (float) Math.atan2(y, x) * DEG;
   }

   public static final float atan2(float y, float x)
   {
      float add, mul;

      if (x < 0.0f)
      {
         if (y < 0.0f)
         {
            x = -x;
            y = -y;

            mul = 1.0f;
         }
         else
         {
            x = -x;
            mul = -1.0f;
         }

         add = -3.141592653f;
      }
      else
      {
         if (y < 0.0f)
         {
            y = -y;
            mul = -1.0f;
         }
         else
         {
            mul = 1.0f;
         }

         add = 0.0f;
      }

      float invDiv = 1.0f / (((x < y) ? y : x) * INV_ATAN2_DIM_MINUS_1);

      int xi = (int) (x * invDiv);
      int yi = (int) (y * invDiv);

      return (atan2[yi * ATAN2_DIM + xi] + add) * mul;
   }

The lengthy if/else-chain might seem inefficient, but it is really the fastest way to normalize X and Y.

Benchmark:


      float min = -100;
      float max = +100;
      float step = 0.12f;

      for (int i = 0; i < 8; i++)
      {
         long t0A = System.nanoTime() / 1000000L;
         float sumA = 0.0f;
         for (float y = min; y < max; y += step)
            for (float x = min; x < max; x += step)
               sumA += FastMath.atan2(y, x);
         long t1A = System.nanoTime() / 1000000L;

         long t0B = System.nanoTime() / 1000000L;
         float sumB = 0.0f;
         for (float y = min; y < max; y += step)
            for (float x = min; x < max; x += step)
               sumB += Math.atan2(y, x);
         long t1B = System.nanoTime() / 1000000L;

         System.out.println();
         System.out.println("FastMath: " + (t1A - t0A) + "ms, sum=" + sumA);
         System.out.println("JavaMath: " + (t1B - t0B) + "ms, sum=" + sumB);
         System.out.println("factor: " + ((float) (t1B - t0B) / (t1A - t0A)));
      }

Results:


...

FastMath: 84ms, sum=-2705.0369
JavaMath: 1161ms, sum=-2690.9863
factor: 13.821428

FastMath: 84ms, sum=-2705.0369
JavaMath: 1164ms, sum=-2690.9863
factor: 13.857142

Have fun! (or beat me!)

great work Riven… how about you do same for other trigonometry ops? :slight_smile:
How about fast accuracy comparison, for your 7 bit example, in what digit it starts to differ (be inaccurate) from java’s atan2? I don’t really understand how it works yet (although I just glanced at it) so I can’t understand your accuracy explaination.

I think these are pretty common… anyway:


   private static final float RAD = (float) Math.PI / 180.0F;

   private static final int SIN_BITS = 12;
   private static final int SIN_MASK = ~(-1 << SIN_BITS);
   private static final int SIN_COUNT = SIN_MASK + 1;

   private static final float radFull = (float) (Math.PI * 2.0);
   private static final float degFull = (float) (360.0);
   private static final float radToIndex = SIN_COUNT / radFull;
   private static final float degToIndex = SIN_COUNT / degFull;

   private static final float[] sin = new float[SIN_COUNT];
   private static final float[] cos = new float[SIN_COUNT];

   static
   {
      for (int i = 0; i < SIN_COUNT; i++)
      {
         sin[i] = (float) Math.sin((float) (i + 0.5f) / SIN_COUNT * radFull);
         cos[i] = (float) Math.cos((float) (i + 0.5f) / SIN_COUNT * radFull);
      }
   }



   /**
    * SIN / COS (RAD)
    */

   public static final float sin(float rad)
   {
      return sin[(int) (rad * radToIndex) & SIN_MASK];
   }

   public static final float cos(float rad)
   {
      return cos[(int) (rad * radToIndex) & SIN_MASK];
   }



   /**
    * SIN / COS (DEG)
    */

   public static final float sinDeg(float deg)
   {
      return sin[(int) (deg * degToIndex) & SIN_MASK];
   }

   public static final float cosDeg(float deg)
   {
      return cos[(int) (deg * degToIndex) & SIN_MASK];
   }



   /**
    * SIN / COS (DEG - STRICT)
    */

   public static final float sinDegStrict(float deg)
   {
      return (float) Math.sin(deg * RAD);
   }

   public static final float cosDegStrict(float deg)
   {
      return (float) Math.cos(deg * RAD);
   }
}

My quite solid benchmarks say it is about 60x faster than Math operations, but I expect that to be some weird (say: unreal) optimisation in the JIT, although I tried to make that impossible.

It’s fair to assume a factor 10 speedup though.

About the update:

  1. doubled the accuracy while using the same bit-count
  2. made some small optimisations, resulting in 40-50% faster code

Sorry for bumbing this :clue:

Is there a way to get your newest versions of your fast math operations? I know there’s quite a few, or I might have dreamt it some time. Let’s hope!

When is the accuracy a problem with this? How many pixels off can we get, on a 1000 x 1000 plane? I suspect it to not be a problem at all?

This is faster and a bit more accurate (at least on the -100.0f to 100.0f range you’re testing).

int i;
if ( x < y )
	i = ATAN2_DIM_MINUS_1 * ATAN2_DIM + (int)(x * (ATAN2_DIM_MINUS_1 / y));
else
	i = (int)(y * (ATAN2_DIM_MINUS_1 / x)) * ATAN2_DIM + ATAN2_DIM_MINUS_1;

return (atan2[i] + add) * mul;

I improved it. Now it uses O(N) memory instead of O(N2) and because it eliminates unnecessary assignments, multiplications, additions and 2-dimensional addressing, it’s a lot faster. I increased the table size for more precision.

The theory:

I added the option to not only return results from -Pi to Pi, but any range you want.

public class FastMath
{
    public static void main(String[] args)
    {
        float min = -100;
        float max = +100;
        float step = 0.12f;

        for (int i = 0; i < 8; i++)
        {
            long t0A = System.nanoTime() / 1000000L;
            float sumA = 0.0f;
            for (float y = min; y < max; y += step)
                for (float x = min; x < max; x += step)
                    sumA += atan2(y, x);
            long t1A = System.nanoTime() / 1000000L;

            long t0B = System.nanoTime() / 1000000L;
            float sumB = 0.0f;
            for (float y = min; y < max; y += step)
                for (float x = min; x < max; x += step)
                    sumB += Math.atan2(y, x);
            long t1B = System.nanoTime() / 1000000L;

            System.out.println();
            System.out.println("FastMath: " + (t1A - t0A) + "ms, sum=" + sumA);
            System.out.println("JavaMath: " + (t1B - t0B) + "ms, sum=" + sumB);
            System.out.println("factor: " + (float)(t1B - t0B) / (t1A - t0A));
        }
    }

    private static final int           SIZE                 = 1024;
    private static final float        STRETCH            = Math.PI;
    // Output will swing from -STRETCH to STRETCH (default: Math.PI)
    // Useful to change to 1 if you would normally do "atan2(y, x) / Math.PI"

    // Inverse of SIZE
    private static final int        EZIS            = -SIZE;
    private static final float[]    ATAN2_TABLE_PPY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_PPX    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_PNY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_PNX    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NPY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NPX    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NNY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NNX    = new float[SIZE + 1];

    static
    {
        for (int i = 0; i <= SIZE; i++)
        {
            float f = (float)i / SIZE;
            ATAN2_TABLE_PPY[i] = (float)(StrictMath.atan(f) * STRETCH / StrictMath.PI);
            ATAN2_TABLE_PPX[i] = STRETCH * 0.5f - ATAN2_TABLE_PPY[i];
            ATAN2_TABLE_PNY[i] = -ATAN2_TABLE_PPY[i];
            ATAN2_TABLE_PNX[i] = ATAN2_TABLE_PPY[i] - STRETCH * 0.5f;
            ATAN2_TABLE_NPY[i] = STRETCH - ATAN2_TABLE_PPY[i];
            ATAN2_TABLE_NPX[i] = ATAN2_TABLE_PPY[i] + STRETCH * 0.5f;
            ATAN2_TABLE_NNY[i] = ATAN2_TABLE_PPY[i] - STRETCH;
            ATAN2_TABLE_NNX[i] = -STRETCH * 0.5f - ATAN2_TABLE_PPY[i];
        }
    }

    /**
     * ATAN2
     */

    public static final float atan2(float y, float x)
    {
        if (x >= 0)
        {
            if (y >= 0)
            {
                if (x >= y)
                    return ATAN2_TABLE_PPY[(int)(SIZE * y / x + 0.5)];
                else
                    return ATAN2_TABLE_PPX[(int)(SIZE * x / y + 0.5)];
            }
            else
            {
                if (x >= -y)
                    return ATAN2_TABLE_PNY[(int)(EZIS * y / x + 0.5)];
                else
                    return ATAN2_TABLE_PNX[(int)(EZIS * x / y + 0.5)];
            }
        }
        else
        {
            if (y >= 0)
            {
                if (-x >= y)
                    return ATAN2_TABLE_NPY[(int)(EZIS * y / x + 0.5)];
                else
                    return ATAN2_TABLE_NPX[(int)(EZIS * x / y + 0.5)];
            }
            else
            {
                if (x <= y) // (-x >= -y)
                    return ATAN2_TABLE_NNY[(int)(SIZE * y / x + 0.5)];
                else
                    return ATAN2_TABLE_NNX[(int)(SIZE * x / y + 0.5)];
            }
        }
    }
}

Nice! I’ll do some benchmarks when I get home. My atan2 was basically a huge hack, that would explode when x or y approached zero.

WARNING: micro-benchmarking heavily table based is very misleading as the tables will be remain cached in the benchmark…were that is unlikely under real usage patterns. Likewise walking linearly through the test range will cause branch predictors to be getting “right” virtually 100% of the time, where in real usage this will not be the case.

I just benchmarked my real application and it went from 20fps to 70fps. It’s not only atan2, the 4th operation in the following sequence is atan2. Other operations include loading an image (memory intensive).

processing times in milliseconds.

before:
ms=1.298769 1.79073 1.291225 31.39561 0.686121 1.968686 0.729702 1.721866

after:
ms=1.283403 1.842692 1.211607 4.929677 0.801498 2.235479 0.755683 1.747324

From 31ms to 5ms is a 6x improvement.

My raw code+benchmark give this:

`FastMath: 42ms, sum=-2688.061
JavaMath: 562ms, sum=-2690.9863
factor: 13.380953

FastMath: 40ms, sum=-2688.061
JavaMath: 519ms, sum=-2690.9863
factor: 12.975`

In contrast, on my machine, Riven’s original algorithm+benchmark was 8.5x faster out-of-the-box, as opposed to the claimed 13.8x

Hum. Why are you calling atan2 so much? That seems excessive.

Zom-B, your version is definitely more accurate and requires half the look-up table size, but it’s also slower than Riven’s version (~38% slower). Also, the JavaMath timing in your results looks suspiciously high; on which JVM and what kind of CPU did you run the test on? Did you use the -server flag?

I use JDK 1.7 with Eclipse Indigo on Core2(duo) 2.5GHz and 32-bit windows XP. I don’t use the -server flag.

I don’t see how my version could possibly be slower, unless you copied the early double variant which I soon replaced with the float variant.

My algorithm uses 3 comparisons, 1 multiplication, 1 division, 1 addition, 1 table look-up and 0~1 negations.

Riven’s version uses 3 comparisons, 4 multiplications, 1 division, 1 addition, 1 table look-up, 0~2 negations, 5~7 assignments and 2 casts.

As usual, and already mentioned in this thread, caching is the bottleneck in lookup tables. By making your access patterns very specific, you can make either of the two ‘win’ in performance.

Just FYI:


$ javac FastMath.java && java -cp . FastMath
FastMath.java:33: error: possible loss of precision
    private static final float        STRETCH            = Math.PI;
                                                               ^
  required: float
  found:    double
1 error

Here it is with random data and 10 million iterations:


Running FastMath 

FastMath: 1173ms, sum=157309.11
JavaMath: 2090ms, sum=158855.06
factor: 1.7817562

Modified code below:


import java.util.Random;

public class FastMath
{
    static Random rng = new Random();

    public static void main(String[] args)
    {
        int iterations = 10000000; // 10 million
        float min = -100;
        float max = +100;
        float step = 0.12f;


        long t0A = System.nanoTime() / 1000000L;
        float sumA = 0.0f;
        for (int i = 0; i < iterations; i++) 
            sumA += atan2(randFloat(min,max), randFloat(min,max));
        long t1A = System.nanoTime() / 1000000L;
        
        long t0B = System.nanoTime() / 1000000L;
        float sumB = 0.0f;
        for (int i = 0; i < iterations; i++) 
            sumB += Math.atan2(randFloat(min,max), randFloat(min,max));
        long t1B = System.nanoTime() / 1000000L;
        
        System.out.println();
        System.out.println("FastMath: " + (t1A - t0A) + "ms, sum=" + sumA);
        System.out.println("JavaMath: " + (t1B - t0B) + "ms, sum=" + sumB);
        System.out.println("factor: " + (float)(t1B - t0B) / (t1A - t0A));
    }

    private static float randFloat(float min, float max) {
        return min + (float)((rng.nextDouble() * ((double)max - (double)min)) + 1);
    }


    private static final int           SIZE                 = 1024;
    private static final float         STRETCH            = (float)Math.PI;
    // Output will swing from -STRETCH to STRETCH (default: Math.PI)
    // Useful to change to 1 if you would normally do "atan2(y, x) / Math.PI"

    // Inverse of SIZE
    private static final int        EZIS            = -SIZE;
    private static final float[]    ATAN2_TABLE_PPY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_PPX    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_PNY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_PNX    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NPY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NPX    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NNY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NNX    = new float[SIZE + 1];

    static
    {
        for (int i = 0; i <= SIZE; i++)
        {
            float f = (float)i / SIZE;
            ATAN2_TABLE_PPY[i] = (float)(StrictMath.atan(f) * STRETCH / StrictMath.PI);
            ATAN2_TABLE_PPX[i] = STRETCH * 0.5f - ATAN2_TABLE_PPY[i];
            ATAN2_TABLE_PNY[i] = -ATAN2_TABLE_PPY[i];
            ATAN2_TABLE_PNX[i] = ATAN2_TABLE_PPY[i] - STRETCH * 0.5f;
            ATAN2_TABLE_NPY[i] = STRETCH - ATAN2_TABLE_PPY[i];
            ATAN2_TABLE_NPX[i] = ATAN2_TABLE_PPY[i] + STRETCH * 0.5f;
            ATAN2_TABLE_NNY[i] = ATAN2_TABLE_PPY[i] - STRETCH;
            ATAN2_TABLE_NNX[i] = -STRETCH * 0.5f - ATAN2_TABLE_PPY[i];
        }
    }

    /**
     * ATAN2
     */

    public static final float atan2(float y, float x)
    {
        if (x >= 0)
        {
            if (y >= 0)
            {
                if (x >= y)
                    return ATAN2_TABLE_PPY[(int)(SIZE * y / x + 0.5)];
                else
                    return ATAN2_TABLE_PPX[(int)(SIZE * x / y + 0.5)];
            }
            else
            {
                if (x >= -y)
                    return ATAN2_TABLE_PNY[(int)(EZIS * y / x + 0.5)];
                else
                    return ATAN2_TABLE_PNX[(int)(EZIS * x / y + 0.5)];
            }
        }
        else
        {
            if (y >= 0)
            {
                if (-x >= y)
                    return ATAN2_TABLE_NPY[(int)(EZIS * y / x + 0.5)];
                else
                    return ATAN2_TABLE_NPX[(int)(EZIS * x / y + 0.5)];
            }
            else
            {
                if (x <= y) // (-x >= -y)
                    return ATAN2_TABLE_NNY[(int)(SIZE * y / x + 0.5)];
                else
                    return ATAN2_TABLE_NNX[(int)(SIZE * x / y + 0.5)];
            }
        }
    }
}

I forgot to re-seed the RNG between each test. So I decided to fix that, but rather than do the easy thing, I went with a big array of samples instead, so now it’s 10 million iterations cycling over an array of 50,000 random floats. The results were interesting: FastMath’s relative performance more than doubles past its previous benchmark.

FastMath: 366ms, sum=294547.25
JavaMath: 1333ms, sum=294536.22
factor: 3.6420765

Not (too) surprising (to me) actually.

java.util.Random is relatively slow, compared to calculating the index and fetching the value of an entry of a lookup table.

But as always, it proves that micro benchmarks are hard.

Having said that, how do mine and his code compare to eachother on your system (and in your testcase) ?

Sure Random is expensive, but both should have been billed equally. Cache effect I guess.

I’m just using ZomB’s version, didn’t actually try the first chunk of code, thought they were basically the same. I’ll try yours out when I get some time.