Arithmictic Performance

Riven · August 9, 2009, 12:07pm

@SIMD

FYI, the Sun JVM doesn’t have any SIMD code. Maybe in System.arraycopy() but that’s it.

Because ‘throughput’ cannot be expressed in ‘cycles’. Throughput is the inverse of the amount of cycles it takes to perform some operation.

Appears. Yeah. Vague handwaving is a perfectly solid foundation of an argument. To make all data fit in cache, reduce the ‘elems’ variable and test again… : I hope I’m not asking too much.

Further, more than 75% of the operations are on [local variables + a 4x4 matrix], if that is going to be memory bound…

Running with 256 ‘elems’ (2x float[256] = 2K RAM and 2x double[256] = 4K RAM)


float performance: 12ms (midpoint)
double performance: 13ms (midpoint)
float performance: 12ms (midpoint)
double performance: 13ms (midpoint)
float performance: 19ms (midpoint)
double performance: 19ms (midpoint)
float performance: 12ms (midpoint)
double performance: 12ms (midpoint)
float performance: 12ms (midpoint)
double performance: 12ms (midpoint)
float performance: 13ms (midpoint)
double performance: 19ms (midpoint)
float performance: 19ms (midpoint)
double performance: 19ms (midpoint)
float performance: 19ms (midpoint)
double performance: 19ms (midpoint)

DzzD · August 9, 2009, 12:18pm

[quote]I’d be happy to be proven wrong with a counter-example or pointer to CPU specification!
[/quote]
I will try to when I get some time

the transition from float to double is just the same story as the old transition (previous to FPU born) of cpu word from 16bit to 32bit . and the present/near futur is transition from 32 to 64 bit.

15/20 years ago accessing a 16bit word was faster and todays it is becomed slower as all the architecture have now reached and have been thinked for 32bit

Roquen · August 9, 2009, 4:16pm

@Riven: Showing code that runs faster in float vs. double is easy.

On: Early P4, Last P4, Core Duo (2 core), Core Duo (4 core) - Sun JRE 1.6.0_13-b03:

sin core is between 1.5-1.8 times faster in float.
ortho polynomial is between 1.2-1.5 time faster.


public class Foo
{
  public static float approxF(float x)
  {
    float x2, r;
    
    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;

    return r;
  }

  public static double approxD(double x)
  {
    double x2, r;
    
    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;

    return r;
  }

  public static double evalD(int n, double x)
  {
    double r = 1;
    double a = 1;
    double b = x;
    int    i;
    
    if (n >= 2) {
      for (i = 2; i <= n; i++) {
        r = ((2 * i - 1) * x * b - (i - 1) * a) / i;
        a = b;
        b = r;
      }
      return r;
    }
    
    if (n == 0)
      return a;
    
    return b;
  }

  public static float evalF(int n, float x)
  {
    float r = 1;
    float a = 1;
    float b = x;
    int   i;
    
    if (n >= 2) {
      for (i = 2; i <= n; i++) {
        r = ((2 * i - 1) * x * b - (i - 1) * a) / i;
        a = b;
        b = r;
      }
      return r;
    }
    
    if (n == 0)
      return a;
    
    return b;
  }

  public static void main(String[] args)
  {
    double x = 1.5707;
    int    i;

    try {
      if (args.length != 0)
	x = Double.parseDouble(args[0]);
    }
    catch(Exception e) {}

    float  f = approxF((float)x);
    double d = approxD(x);
    int    e = 100000;

    long t0 = System.nanoTime();

    for(i=0; i<e; i++)
      d = approxD(d);

    long t1 = System.nanoTime();
    
    for(i=0; i<e; i++)
      f = approxF(f);

    long t2 = System.nanoTime();

    t0 = t1-t0;
    t1 = t2-t1;
    
    System.out.printf("double = %d (x %f)\n", t0, (1.f*t0)/t1);
    System.out.printf("float  = %d\n", t1);

    e = 20000;
    d = evalD(10, d);
    f = evalF(10, f);

    t0 = System.nanoTime();

    for(i=10; i<e; i++)
      d += evalD(i, .332);
    
    t1 = System.nanoTime();
    
    for(i=10; i<e; i++) 
      f += evalF(i, .332f);
    
    t2 = System.nanoTime();
    
    t0 = t1-t0;
    t1 = t2-t1;

    System.out.printf("double = %d (x %f)\n", t0, 1.0*t0/t1);
    System.out.printf("float  = %d\n", t1);

    System.out.printf("%f\n", f);
    System.out.printf("%f\n", d);
  }
}

@DzzD: I started programming in the 8-bit days, but that isn’t my problem. I use doubles and multi-precision elements all the time…when needed.

There are three important trends that form my opinion about doubles being unlikely to outperform floats on consumer hardware in my lifetime. The first two have already been mentioned: SIMD and speed gap with main memory. The other is die size reductions. Making the channels narrower requires increased energy consumption (and therefore heat). Thus it is becoming more and more important to shut down subsystems which are not in usage to reduce heat (and battery drain in the case of notebooks). Computation of doubles requires wider data paths and more stages (to slightly vuglarize).

In real life I do a fair amount of low level and optimization work so I read a fair number of CPU design specs and semi-keep up with hardware design research and am seeing nothing to contradict this opinon. However I’ll admit that I never expected to see anything like the new decimal formats from IEEE 754-2008 either.

Riven · August 9, 2009, 4:40pm

Roquen:

@Riven: Showing code that runs faster in float vs. double is easy.

On: Early P4, Last P4, Core Duo (2 core), Core Duo (4 core) - Sun JRE 1.6.0_13-b03:

sin core is between 1.5-1.8 times faster in float.
ortho polynomial is between 1.2-1.5 time faster.
  public static double approxD(double x)
  {
    double x2, r;
    
    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;

    return r;
  }

Do you notice you are mixing doubles and floats here??

Roquen · August 9, 2009, 4:58pm

Opps. I’d like to say that was on purpose, but it wasn’t. Doesn’t change the timing though.

pjt33 · August 9, 2009, 6:55pm

Should that read IEEE 754-2008?

Roquen · August 9, 2009, 7:52pm

Yeap, that should be 754. Probably more commonly known by it’s working title of 754r.

Nate · August 9, 2009, 8:51pm

For sqrt, I tried Float.floatToRawIntBits, but it is slow on Android. I also played with DzzD’s code, but as he indicated, it needs some smarts to be fast enough. Here is Riven’s benchmark code, modified to show DzzD’s sqrt algorithm:

	public static void testSqr (boolean log) {
		int count = 50000;

		long s, e;

		float[] numbers = new float[count];
		for (int i = 0; i < count; i++)
			numbers[i] = (float)(Math.random() * 65536) + 1;
// for (int i = 0; i < count / 2; i++)
// numbers[i] = (float)(Math.random() * -65536);

		// ensure the JVM doesn't optimize those silly calls away!!
		double[] java_dst = new double[count];
		float[] dzzd_dst = new float[count];
		float[] roquen_dst = new float[count];

		s = System.nanoTime();
		for (int i = 0; i < count; i++)
			java_dst[i] = Math.sqrt(numbers[i]);
		e = System.nanoTime();
		if (log) System.out.println("FloatMath.sqrt:  " + (e - s) / 1000 + "us");

		s = System.nanoTime();
		for (int i = 0; i < count; i++)
			roquen_dst[i] = sqrtDzzD(numbers[i]);
		e = System.nanoTime();
		if (log) System.out.println("sqrtDzzD: " + (e - s) / 1000 + "us");

		if (log) {
			double dzzdAvgErr = 0.0;
			double dzzdMaxErr = 0.0;
			for (int i = 0; i < count; i++) {
				double dzzdErr = Math.abs(Math.sqrt(numbers[i]) - sqrtDzzD(numbers[i]));

				dzzdAvgErr += dzzdErr;
				if (dzzdErr > dzzdMaxErr) dzzdMaxErr = dzzdErr;
			}
			dzzdAvgErr /= count;

			System.out.println("Input: " + numbers[3]);
			System.out.println("DzzD: " + sqrtDzzD(numbers[3]) + ", avg.error=" + dzzdAvgErr + ", max.error=" + dzzdMaxErr);
			System.out.println("FloatMath: " + Math.sqrt(numbers[3]));
			System.out.println("~~~prevent opt. ~~~" + dzzd_dst[13] + "~~~" + java_dst[13] + "~~~" + roquen_dst[13]);
			System.out.println();
		}
	}

	static public float sqrtDzzD (float n) {
		if (n == 0) return 0;
		float r = n * 0.5f;
		float r2 = r * r;
		float d = r * 0.5f;
		float error;
		do {
			r += r2 > n ? -d : d;
			r2 = r * r;
			d *= 0.5f;
			if (d == 0) return r;
			error = r2 - n;
		} while ((error < 0 ? -error : error) > 1);
		return r;
	}

Roquen · August 10, 2009, 7:05am

On sqrt I have a couple of questions and maybe I can come up with a different initial guess for a N-R based method.

How many N-R steps can you perform before they equal the speed of the default sqrt? and 1/sqrt?
If they exist are Math.getExponent and Math.scaleb can be used instead of bit inspection, but I’d expect them to be slow as well (worth checking)
Speed of conversion: fp to int and int to fp.
Speed of either lead or trailing zero count of 32 bit ints.

pjt33 · August 10, 2009, 7:41am

Doesn’t change the bytecode - javac will create double entries in the constant pool even though the constants themselves were floats, because they’re extended before they’re used.

Trailing zero count is pretty fast. Code is adapted from http://graphics.stanford.edu/~seander/bithacks.html

private static final int[] trailLut = new int[] {
    0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 
    31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
  };

public static int countTrailingBits(int x) {
  return trailLut[((x & -x) * 0x77cb531) >>> 27];
}

Finding the integer log base 2 of an integer is also covered in the page linked, although it’s a bit slower.

Roquen · August 10, 2009, 7:57am

I finally got ‘unlazy’ and did a quick web search on Android devices. All that I saw are based on the ARM-11. So lead zero counting is a hardware instruction and should be fast (unless someone forgot to hook it in).

I didn’t get any idea about float support. Not being able to find tech specs is a big pevee of mine. The ARM-11 does not have an FPU, but ARM provides various FPUs as coprocessors.

I’ll try to throw together a zero counting base guess version. (on the log-2, I guess you see where I’m going with the zero counting).

pjt33 · August 10, 2009, 10:27am

When you say “forgot to hook it in” - to what? There isn’t a corresponding bytecode, so are you referring to a hypothetical android.util.IntMath? I don’t see why it should be there - the spec surely isn’t designed around a particular CPU?

Re log 2 - yes, it was pretty obvious I implemented a fixed-point sqrt once, but I can’t remember what I did about the initial guess; just that it involved a lookup table.

Roquen · August 10, 2009, 10:54am

That Integer.numberOfLeadingZeros is replaced at link time by a native method, rather than executing the bytecode. I haven’t looked into the guts of Dalvik, but I’d expect it to do this.

Hansdampf · August 10, 2009, 11:04am

for those who need a faster integer sqrt:
http://atoms.alife.co.uk/sqrt/SquareRoot.java
I did not test it under Android. With jre1.6 the performance factor is 2.4 (5.3 with -server)

pjt33 · August 10, 2009, 11:32am

Didn’t realise it was in Integer. Ooh. I’ll have to look and see what else they added in 1.5.

Roquen · August 12, 2009, 1:59pm

This is a quick hack to test a workaround for slow floatToRawIntBits on Android. As I don’t have hardware I can’t do timing tests myself. If someone with hardware is willing to test please try timing “getExponent” against “timeMe”.

Has the following methods: getExponent, getSignificand, scalb and a first pass isqrt. These were thrown together so probably have bugs and isqrt is a quick pass of making a reference version (included) not completely dog-slow.


  /**
   * Float value x = f 2<sup>k</sup>, where 1 &le; f &lt; 2 and k is an integer.
   * <p>
   * Returns k.
   * <p>
   * Differs from JDK version in that the exponents of denormals
   * are returned as if representable.
   * @see #getSignificand(float)
   */

  public static int getExponent(float x)
  {
    int bs = 31;

    if (x < 0)
      x = -x;
    
    if (x < Float.POSITIVE_INFINITY) {
      if (x >= 1) {
        while (x > 0x1p30) {
          x  *= 0x1p-30;
          bs += 30;
        }
      }
      else {
        if (x == 0) return 0;

        do {
          x  *= 0x1p30;
          bs -= 30;
        } while(x < 1);
      }
      
      return bs-Integer.numberOfLeadingZeros((int)x);
    }

    // x is +/-infinity or NaN
    return 128;
  }
  


  /**
   * Float value x = f 2<sup>k</sup>, where 1 &le; f &lt; 2 and k is an integer.
   * <p>
   * Returns abs(f).
   * @see #getExponent(float)
   */
  public static float getSignificand(float x)
  {    
    if (x < 0) x = -x;
    
    // infinity and NaN
    
    if (x >= 1) {
      if (x >= 0x1p30) {
        if (x == Float.POSITIVE_INFINITY) return -1;
        do {
          x *= 0x1p-30;
        } while (x > 0x1p30);
      }
    }
    else {
      if (x == 0) return 0;

      do {
        x *= 0x1p30;
      } while(x < 1);
    }
    
    if (x < 2) return x;
    
    // scale to [1,2)
    int bs = Integer.numberOfLeadingZeros((int)x);
    x  *= (1<<bs) * 0x1p-31f;
    
    return x;
  }

  /**
   * Returns: x * 2<sup>n</sup>
   */
  public static float scalb(float x, int n)
  {
    if (n >= 0) {
      while(n > 30) {
        x *= 0x1p30f;
        n -= 30;
      }
      return x * (1 << n);
    }
    
    while(n < -30) {
      x *= 0x1p-30f;
      n += 30;
    }
    
    return x * (1<<(31+n)) * 0x1p-31f;
  }

  // constants for 1st order inverse sqrt guess
  private static final float a0 =  0x1.439cfep+0f;
  private static final float a1 = -0x1.253f20p-2f;

  // constants for 2nd order inverse sqrt guess
  private static final float b0 =  0x1.946338p+0f;
  private static final float b1 = -0x1.7605f6p-1f;
  private static final float b2 =  0x1.2e76d0p-3f;
  
  /** sqrt(2)/2 */
  private static final float sqrt2o2 = 0x1.6a09e6p-1f;
  
  // reference version: shows what's happening in isqrt.
  public static float isqrtRef(float x)
  {
    // get same info as Float.floatToRaw bits
    // (but in fp instead of integer)
    float man = getSignificand(x);
    int   exp = getExponent(x);
    
    float r;

    // make an initial guess:  1 <= man < 2
    // two examples:
  //r =  a0 + a1 * man;
    r =  b0 + man*(b1 + man*b2);
   
    // add in the 1/sqrt(2^(k/2)) term
    if (exp != 0) {
      r = scalb(r,-(exp>>1));
      if ((exp & 1) != 0) r *= sqrt2o2;
    }
    
    float hx  = x * 0.5f;

    // perform some number of N-R steps
    r = r * (1.5f - hx * r * r);
    r = r * (1.5f - hx * r * r);
    r = r * (1.5f - hx * r * r);
    
    return r;
  }
  
  // temp hack for unoptimized isqrt
  private static float scalb_temp_hack(float x, int n)
  {
    if (n >= 0)
      return x * (1 << n);
  
    return x * (1<<(31+n)) * 0x1p-31f;
  }
  
  public static float isqrt(float y)
  {
    float x  = y;
    int   bs = 31;

    // scale if needed to bring into integer range
    if (x >= 1) {
      while (x > 0x1p30) {
        x  *= 0x1p-30;
        bs += 30;
      }
    }
    else {
      if (x == 0) return 0;

      do {
        x  *= 0x1p30;
        bs -= 30;
      } while(x < 1);
    }

    // isolate exponent and significand
    int   lzc = Integer.numberOfLeadingZeros((int)x);
    int   exp = bs-lzc;
    float man = x;
    float hx = 0.5f*y;
    
    if (x >= 2)
      man  *= (1<<lzc) * 0x1p-31f;
    
    // make an initial guess
     float r =  a0 + a1 * man;
   //float r =  b0 + man*(b1 + man*b2);

    // add in the 1/sqrt(2^(k/2)) term
    if (exp != 0) {
      r = scalb_temp_hack(r, -(exp >> 1));
      if ((exp & 1) != 0) r *= sqrt2o2;
    }

    // perform some number of N-R steps
    r = r * (1.5f - hx * r * r);
    //r = r * (1.5f - hx * r * r);
    //r = r * (1.5f - hx * r * r);
    
    return r;
  }
   
  @SuppressWarnings("boxing")
  public static float timeSqrt()
  {
    float x,r0=0;
    x = 65536;

    long t0,t1;

    // test some large values
    x  = 0x1.133321p122f;
    t0 = System.nanoTime();
    
    do {
    //r0 += (1.f / (float) Math.sqrt(x));
      r0 += isqrt(x);
      x   = Math.nextUp(x);
    } while (x <= 0x1.133321p122f * 2);

    t1 = System.nanoTime();
    
    System.out.printf("time (big)   = %f (%f)\n", (t1-t0)*(1/1000000.0), r0);
    
    // test some small values
    x  = 0x1.133321p-122f;
    t0 = System.nanoTime();
    
    do {
    //r0 += (1.f / (float) Math.sqrt(x));
      r0 += isqrt(x);
      x   = Math.nextUp(x);
    } while (x <= 0x1.133321p-122f * 2);

    t1 = System.nanoTime();
    
    System.out.printf("time (small) = %f (%f)\n", (t1-t0)*(1/1000000.0), r0);
    
    return r0;
  }

  public static float timeMe(float f)
  {
    return Float.intBitsToFloat(Float.floatToRawIntBits(f) + 1);
  }

Nate · August 15, 2009, 4:46am

It took me a bit to figure out the 1.6 Math.nextUp call was causing Android to throw VerifyError!

Here are the milliseconds I got on the G1 with 10000 iterations (passing the i to the method):

Maybe FloatMath.sqrt is the fastest we are going to get? I don’t have any proof it is a bottleneck.

[quote]FloatMath.sqrt: 49.102783
isqrt: 134.52148

FloatMath.sqrt: 45.98999
isqrt: 139.4043
[/quote]
What do you guys think of these methods? I forgot where I scrounged them up.

	/**
	 * Fixed point multiply.
	 */
	static public int multiply (int x, int y) {
		long z = (long)x * (long)y;
		return ((int)(z >> 16));
	}

	/**
	 * Fixed point divide.
	 */
	static public int divide (int x, int y) {
		return (int)(((((long)x) << 32) / y) >> 16);
	}

pjt33 · August 15, 2009, 10:41am

Not sure why you wouldn’t just do:

	/**
	 * Fixed point divide.
	 */
	static public int divide (int x, int y) {
		return (int)((((long)x) << 16) / y);
	}

Not that I tend to favour 16.16 fix-point anyway. 24.8 is usually good enough and most of the time it allows you to fit intermediate results in 32 bits.

Nate · August 15, 2009, 7:39pm

Thanks pjt33. Your method seems to work just as well and is simpler.

In general, I haven’t found using fixed point to be any faster than just using floats, even though the G1 doesn’t have an FPU. I only use fixed point with OpenGL ES, which is 16.16. I found any real number crunching is going to have to be native code. I didn’t try using fixed point in native code as it doesn’t seem to be a bottleneck.

pjt33 · August 15, 2009, 9:19pm

Looking at some of my cryptic comments around some fixed point code I have it seems that I was at one point convinced that shifting stuff as far left as possible before division was more accurate. However, what I was doing was the equivalent of (this code for 16.16):

return (int)((((long)x) << 32) / (y << 16));

Shrug. Actually what I was doing was a bit more complicated (overflow detection was included). It was for Java4k, and I discovered that using fixed point I got better compression. There was plenty of integer arithmetic going on, so using only integer arithmetic rather than a mix of integer and float meant lower entropy in the bytecodes used.

I have actually worked in a company which strongly encouraged use of fixed point for various reasons; one case was networked multiplayer physics engines, which otherwise required strictfp for consistency.