Arithmictic Performance


                Ints 	Longs 	Floats 	Doubles
Addition 	            1 	2,12 	4,9 	6,22
Subtraction 	0,98 	2,12 	5,04 	6,31
Multiplication 	1,08 	2,8 	4,14 	4,98
Division 	         2,49 	16,08 	3,55 	4,73
Comparison 	7,18 	7,33 	9,73 	11,92
Cast to int 		0,69 	2,22 	2,1
random 	        41,86 	78,59 	52,88 	91,04
Math.sin/cos 				807,1

http://www.rbgrn.net/content/286-how-to-test-android-performance-using-fps

Kev

Thanks for the info Kev.

Should also point out that these were taken running on the G1, which has no FPU.

Time to replace my sin/cos calls with table lookups!

no FPU

The FP factors do look a lot better than expected though. Well, except for that sin/cos stuff, that is. :wink:

Benchmarking Math.sin/cos isn’t very useful. They should benchmark FloatMath.sin/cos, since that is what you would use on Android.

On a related note, I have experimented with fixed point and found it doesn’t really make a difference. Just go ahead and use floats. If you are writing code that is going to be slow with float, it is going to be slow fixed, and your only recourse is to move it to native code (which is very easy, kudos to the Android team!).

Lookup tables still the way to go using FastMath - just got a stable framerate after dropping them.

Kev

In hardware and software a big part of the cost will be argument reduction. So if your on a limited range you might want to try an approximation.

I’m not clear on this, did you find lookup tables faster than FloatMath for sin/cos?

Yes, considerbly.

Kev

about cos and sin it should be doable for float in about 40/100 using the taylor serie ( http://www.java-gaming.org/index.php/topic,16296.0.html ) depending on the precision (lenght of the serie) you will requiere.

high precision :
1.0+x2*(f2+x2*(f4+x2*(f6+x2*(f8+x2*(f10+x2*(f12+x2*(f14+x2*(f16+x2*(f18+x2*f20)))))))));

10 multiplicaton and 10 addition => 49+41 = 90 (for floats)

low precision
1.0+x2*(f2+x2*(f4+x2*(f6+x2*(f8))));

4 multiplicaton and 4 addition => 19.6 + 16.5 = 36.5 (for float)

if ou know that your angle value are in the correct range you can inline it and get best efficiency

Thanks DzzD. A lookup table is accurate enough for my purposes in this case though.

I see a sin lookup table being about 1.7x faster than FloatMath.sin. Does this correlate to your experience?

How about for sqrt? I find even a lookup table to be about the same speed as FloatMath.sqrt.

FWIW, I found this to be about 2x faster than Math.atan2 (on my G1):

	static public float atan2 (float y, float x) {
		float abs_y = y < 0 ? -y : y;
		float angle;
		if (x >= 0)
			angle = 0.7853981633974483f - 0.7853981633974483f * (x - abs_y) / (x + abs_y);
		else
			angle = 2.356194490192345f - 0.7853981633974483f * (x + abs_y) / (abs_y - x);
		return y < 0 ? -angle : angle;
	}

Source: http://dspguru.com/comp.dsp/tricks/alg/fxdatan2.htm

Gah, my sin lookup stuff sucks. Can anyone share a good implementation?

My sin/cos lookup tables are super basic. But do the job for me:


sinVals = new float[360];
cosVals = new float[360];
for (i=0; i<360; i++) {
	sinVals[i] = (float)Math.sin(i*TO_RADIANS);
	cosVals[i] = (float)Math.cos(i*TO_RADIANS);
}

What about this one:


   private static final int     SIN_BITS              = 12; // adjust these .......
   private static final int     ATAN2_BITS            = 7; // adjust these .......

   private static final float   RAD                   = (float) Math.PI / 180.0f;
   private static final float   DEG                   = 180.0f / (float) Math.PI;

   private static final int     SIN_MASK              = ~(-1 << SIN_BITS);
   private static final int     SIN_COUNT             = SIN_MASK + 1;

   private static final float   radFull               = (float) (Math.PI * 2.0);
   private static final float   degFull               = (float) (360.0);
   private static final float   radToIndex            = SIN_COUNT / radFull;
   private static final float   degToIndex            = SIN_COUNT / degFull;

   private static final float[] sin                   = new float[SIN_COUNT];
   private static final float[] cos                   = new float[SIN_COUNT];

   private static final int     ATAN2_BITS2           = ATAN2_BITS << 1;
   private static final int     ATAN2_MASK            = ~(-1 << ATAN2_BITS2);
   private static final int     ATAN2_COUNT           = ATAN2_MASK + 1;
   private static final int     ATAN2_DIM             = (int) Math.sqrt(ATAN2_COUNT);

   private static final float   INV_ATAN2_DIM_MINUS_1 = 1.0f / (ATAN2_DIM - 1);

   private static final float[] atan2                 = new float[ATAN2_COUNT];

   static
   {
      for (int i = 0; i < SIN_COUNT; i++)
      {
         sin[i] = (float) Math.sin((i + 0.5f) / SIN_COUNT * radFull);
         cos[i] = (float) Math.cos((i + 0.5f) / SIN_COUNT * radFull);
      }

      for (int i = 0; i < ATAN2_DIM; i++)
      {
         for (int j = 0; j < ATAN2_DIM; j++)
         {
            // 0 .. 1
            float x0 = (float) i / ATAN2_DIM;
            float y0 = (float) j / ATAN2_DIM;

            atan2[j * ATAN2_DIM + i] = (float) Math.atan2(y0, x0);
         }
      }
   }

   public static final float sin(float rad)
   {
      return sin[(int) (rad * radToIndex) & SIN_MASK];
   }

   public static final float cos(float rad)
   {
      return cos[(int) (rad * radToIndex) & SIN_MASK];
   }

   public static final float atan2(float y, float x)
   {
      float add, mul;

      if (x < 0.0f)
      {
         if (y < 0.0f)
         {
            x = -x;
            y = -y;

            mul = 1.0f;
         }
         else
         {
            x = -x;
            mul = -1.0f;
         }

         add = -3.141592653f;
      }
      else
      {
         if (y < 0.0f)
         {
            y = -y;
            mul = -1.0f;
         }
         else
         {
            mul = 1.0f;
         }

         add = 0.0f;
      }

      float invDiv = 1.0f / (((x < y) ? y : x) * INV_ATAN2_DIM_MINUS_1);

      int xi = (int) (x * invDiv);
      int yi = (int) (y * invDiv);

      return (atan2[yi * ATAN2_DIM + xi] + add) * mul;
   }

What kind of error do you need? How much memory are you willing to use?

Approaches can be roughly classified as:

  1. Single array lookup after quantisation - Ranger’s and Riven’s implementations
  2. Multiple array lookups with interpolation (i.e. two lookups with linear interpolation, 3 with quadratic, …)
  3. Multiple array lookups with angle addition formulae - I’ll come back to this
  4. Direct approximation by polynomial / rational function - Dzzd’s implementation

On #3: sin(A + B) = sin(A) cos(B) + cos(A) sin(B), cos(A + B) = cos(A) cos(B) - sin(A) sin(B)
One approach would be to multiply the input x by 2 * Pi * (1 << 24), mask off the bottom three bytes individually and do six lookups (total table space required: 256 * 6 entries, probably optimisable to 256*3 if you want to trade memory for some subtractions). 8 multiplications and 4 additions gets you sin x and cos x (or 6 multiplications and 3 additions if you only want one of them).
With a bit more memory (16k entries if you’re using 24 bits of the input) you can reduce that to four lookups, two multiplications, and one addition for sin, same for cos (although obv. if doing both you don’t double the lookups).

I’d suggest timing random table reads (if possible) on the lowest end target. Many modern processors (including those used in embedded systems) will have large timing hits for reading from big tables and an approximation will be faster.

For approximations, truncated power series are pretty bad (bang for your buck) in float or fixed point. Think minimax, generalized pade, et al.

(edit: fixed typo)

Nb: I think that you would be surprised on how precise and fast (addition and multiplication are usually very fast on any cpu) can be the taylor serie (the long version ) especially if you inline it (this avoid a method call), it can be nearly as fast as a lookup table (due to random array cache trouble and other performance problem with access to array), you should really give it a try.

EDIT : also another problem you may have with the lookup table is casting from floating point value to int (to index the lookup array) as casting from floating to fixed is damnly slow

EDIT2 :

about the min max you should have looked the original source code

public static final double cos(double x)
    {
    	
    	//return Math.cos(x);
    	
    	if(x<0.0) x=-x;
    	
    	
    	if(x<PI2) 
    	{
    		if(x<PI)
    		{	
				double x2=x*x;
				return 1.0+x2*(f2+x2*(f4+x2*(f6+x2*(f8+x2*(f10+x2*(f12+x2*(f14+x2*(f16))))))));
				//MORE PRECISE BUT NOT COMPATIBLE JVM MS => return 1.0+x2*(f2+x2*(f4+x2*(f6+x2*(f8+x2*(f10+x2*(f12+x2*(f14+x2*(f16+x2*(f18+x2*f20)))))))));    	
			}
			else
			{
				x-=PI;
				double x2=x*x;
				return -(1.0+x2*(f2+x2*(f4+x2*(f6+x2*(f8+x2*(f10+x2*(f12+x2*(f14+x2*(f16)))))))));
				//MORE PRECISE BUT NOT COMPATIBLE JVM MS => return -(1.0+x2*(f2+x2*(f4+x2*(f6+x2*(f8+x2*(f10+x2*(f12+x2*(f14+x2*(f16+x2*(f18+x2*f20))))))))));    	
			}
    	}
    	
    	
    	x%=PI2;
    	x-=PI;
		double x2=x*x;
		
		return -(1.0+x2*(f2+x2*(f4+x2*(f6+x2*(f8+x2*(f10+x2*(f12+x2*(f14+x2*f16))))))));
		//MORE PRECISE BUT NOT COMPATIBLE JVM MS => return -(1.0+x2*(f2+x2*(f4+x2*(f6+x2*(f8+x2*(f10+x2*(f12+x2*(f14+x2*(f16+x2*(f18+x2*f20))))))))));
		
	}	

Humm. I’m only seeing truncated series in the other thread.

yes sure series are truncated (they must be) but the good thing is that cosinus is symetrical and periodical, so you can always convert an input angle into the range 0 to 0.5*PI and found the result (using for example a modulo) and then it will just give you a smart result as you are always using something like the ‘best range’ for that serie.

Sorry, I’m not being clear. Truncated power series have exploding error away from the expansion point. Here are a couple of quick references:

minimax
Remez

In an attempt to clairify, here’s a quick example for sine.

All these numbers are all bad, I’m mucking around in a computer algebra program.

F1: x*(1 + x2*(-0.166667f + 0.00833333f * x2))
F2: x*(1 + x2*(-0.16605f + 0.00761f * x2))
F3: x*(0.999892f + x2*(-0.165961f + 0.00760319f * x2))

“F1” is a truncated power series (again numbers are bad and only for visual comparison) and the other two are minimax approximations with the following design:

input range = +/- Pi/2
minimize absolute error

“F2” was simply given one less coefficent to play around with.

AbsErrorMax(F1) = 0.00452486 @ Pi/2
AbsErrorMax(F2) = 0.000121018 @ 0.833326
AbsErrorMax(F3) = 0.0000809254 @ 0.868416

ok you are right at first I missunderstood and read min & max and not minimax but … this is finally the same in the particular case of cosinus as those minimax error will be more and more noticable if you get farther than 0, so clamping input into 0 to 0.5PI make them so small that the real Math.cos may introduce bigger error than the taylor serie, an example is given in the original source code where :

MathX.cos(-Math.PI0.25)=0.7071067811865476
Math.cos(-Math.PI
0.25) =0.7071067811865476

MathX.cos(-Math.PI0.5) =0.0
Math.cos(-Math.PI
0.5) =6.123233995736766E-17

what I finally mean is just that Math.cos will also introduce relative error in comparaison of the true cosinus

and the taylor series of cosinus (and because cosinus is periodical) will have very very tiny error and anyway will give smarter result than lookup for the same time cost.

EDIT : as you can see in your sample even with only three polynomial coeficient it give quite smart result so imagine with 10

Riven, thanks for the lookup code (I was using modulus instead of masking…).

Your atan2 lookup, run on my desktop, takes about 1.6 times longer than the atan2 I posted above. However, yours is quite a bit more accurate. It is good to have options! Both are waaay faster (5+ times) than Math.atan2. That was the desktop, on the G1 I see (10k iterations with 20k precalculated random numbers):
Math.atan2: 159.66797
Riven’s atan2: 58.86841
Nate’s atan2: 48.49243
So, a smaller difference all around.

I played with Riven’s lookup table and DzzD’s low precision forumula. Benchmark code:

for (int i = 0; i < 10; i++)
	test(false);
test(true);
test(true);
test(true);

public void test (boolean log) {
	int count = 50000;

	long s, e;
	Random random = new Random();

	float[] numbers = new float[count];
	for (int i = 0; i < count; i++)
		numbers[i] = random.nextFloat();

	s = System.nanoTime();
	for (int i = 0; i < count; i++)
		Math.cos(numbers[i]);
	e = System.nanoTime();
	if (log) System.out.println("Java: " + (e - s) / 1000000f);

	s = System.nanoTime();
	for (int i = 0; i < count; i++)
		FastMath2.cos(numbers[i]); // Riven
	e = System.nanoTime();
	if (log) System.out.println("Riven: " + (e - s) / 1000000f);

	s = System.nanoTime();
	for (int i = 0; i < count; i++)
		FastMath.cos(numbers[i]); // DzzD
	e = System.nanoTime();
	if (log) System.out.println("DzzD: " + (e - s) / 1000000f);

	if (log) {
		System.out.println("Input: " + numbers[3]);
		System.out.println("DzzD: " + FastMath.cos(numbers[3]));
		System.out.println("Riven: " + FastMath2.cos(numbers[3]));
		System.out.println("Java: " + Math.cos(numbers[3]));
		System.out.println();
	}
}

Results on my G1:

[quote]Speed:

Java: 327.54517
Riven: 80.47485
DzzD: 149.93286

Java: 330.07813
Riven: 79.10156
DzzD: 150.08545

Java: 331.29883
Riven: 80.62744
DzzD: 150.23804

Accuracy:

Input: 0.16849637
DzzD: 0.9169282
Riven: 0.98592603
Java: 0.9858380402606572

Input: 0.068103254
DzzD: 0.96614116
Riven: 0.99767107
Java: 0.9976818695836023

Input: 0.99370223
DzzD: 0.5429537
Riven: 0.5459677
Java: 0.5455909445089435
[/quote]
Accuracy probably varies more when the input is not between 0 and 1.

Edit: Anyone have a good solution for sqrt?