Arithmictic Performance

Roquen · August 9, 2009, 11:53am

There’s a wide misconception that doubles are somehow ‘better’ than floats and this is simply not the case. They simply have more precision and you pay for that in terms of memory usage and (usually) execution time (some ops will have same execution time for both). It’s true that for some usages the extra precision is required, but this is likewise true about doubles. I find this somewhat strange because you don’t find people thinking that 64-bit integers are somehow ‘better’ than lower bit width integers and they will happly use the one which in most appropriate for the given usage.

Not at all, they are simply not exposed at the high level. It’s the compiler’s job to use these instructions (runtime in the case of a VM). I’d expect vectorized support to improve as runtimes like Mono have been working on this front. It’ll probably require a new compiler framework as Sun’s seems to be showing it’s age (I haven’t paid attention to what the next gen is up to.) Both vmkit and Shark/Zero are based on LLVM, so they have “promise” in this respect. (Although LLVM needs work on auto-vectorization as well.) If we had a VM that performed a reasonable amount of vectorization, there would be a huge speed difference between floats and doubles on SIMD hardware. But ignoring SIMDifing, scalar floating point operations are generally faster than double (and never slower) for all CPU architectures that I’m familiar with. Couple this with the ever increasing speed gap between main memory and the CPU, moving data becomes a more and more of an issue.

Not sure what doesn’t make sense. These are the measure of time required for the executional unit to complete the computation.

Your example appears to be memory bound. As a simple counter-example: the speed difference between float and double multiple is much narrower than divide, but if you take my last minimax sin example and simply change from float to double you’re likely to see a speed difference on 1.2-2.0x depending on your hardware (and assuming Sun VM).

The question of performance of float vs. doubles is really a hardware question (and the indicated thread was about hardware) and my orginal statement was to that effect. My follow-up was in reply to “IMHO double is probably faster or at least same speed (and for sure will become faster)” which I claim to be incorrect. Is it possible to have two otherwise identical routines where a double version will be faster then the float? Yes, but that will be in the case where stalls of the double version are being more hidden by some other stall (probably memory read or write) that float version. This will be a rare occurance and tightly coupled to an exact hardware configuration. This is more of a desktop issue for CPU with many functional units.

I’d be happy to be proven wrong with a counter-example or pointer to CPU specification!

Riven · August 9, 2009, 12:07pm

@SIMD

FYI, the Sun JVM doesn’t have any SIMD code. Maybe in System.arraycopy() but that’s it.

Because ‘throughput’ cannot be expressed in ‘cycles’. Throughput is the inverse of the amount of cycles it takes to perform some operation.

Appears. Yeah. Vague handwaving is a perfectly solid foundation of an argument. To make all data fit in cache, reduce the ‘elems’ variable and test again… : I hope I’m not asking too much.

Further, more than 75% of the operations are on [local variables + a 4x4 matrix], if that is going to be memory bound…

Running with 256 ‘elems’ (2x float[256] = 2K RAM and 2x double[256] = 4K RAM)


float performance: 12ms (midpoint)
double performance: 13ms (midpoint)
float performance: 12ms (midpoint)
double performance: 13ms (midpoint)
float performance: 19ms (midpoint)
double performance: 19ms (midpoint)
float performance: 12ms (midpoint)
double performance: 12ms (midpoint)
float performance: 12ms (midpoint)
double performance: 12ms (midpoint)
float performance: 13ms (midpoint)
double performance: 19ms (midpoint)
float performance: 19ms (midpoint)
double performance: 19ms (midpoint)
float performance: 19ms (midpoint)
double performance: 19ms (midpoint)

DzzD · August 9, 2009, 12:18pm

[quote]I’d be happy to be proven wrong with a counter-example or pointer to CPU specification!
[/quote]
I will try to when I get some time

the transition from float to double is just the same story as the old transition (previous to FPU born) of cpu word from 16bit to 32bit . and the present/near futur is transition from 32 to 64 bit.

15/20 years ago accessing a 16bit word was faster and todays it is becomed slower as all the architecture have now reached and have been thinked for 32bit

Roquen · August 9, 2009, 4:16pm

@Riven: Showing code that runs faster in float vs. double is easy.

On: Early P4, Last P4, Core Duo (2 core), Core Duo (4 core) - Sun JRE 1.6.0_13-b03:

sin core is between 1.5-1.8 times faster in float.
ortho polynomial is between 1.2-1.5 time faster.


public class Foo
{
  public static float approxF(float x)
  {
    float x2, r;
    
    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;

    return r;
  }

  public static double approxD(double x)
  {
    double x2, r;
    
    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;

    return r;
  }

  public static double evalD(int n, double x)
  {
    double r = 1;
    double a = 1;
    double b = x;
    int    i;
    
    if (n >= 2) {
      for (i = 2; i <= n; i++) {
        r = ((2 * i - 1) * x * b - (i - 1) * a) / i;
        a = b;
        b = r;
      }
      return r;
    }
    
    if (n == 0)
      return a;
    
    return b;
  }

  public static float evalF(int n, float x)
  {
    float r = 1;
    float a = 1;
    float b = x;
    int   i;
    
    if (n >= 2) {
      for (i = 2; i <= n; i++) {
        r = ((2 * i - 1) * x * b - (i - 1) * a) / i;
        a = b;
        b = r;
      }
      return r;
    }
    
    if (n == 0)
      return a;
    
    return b;
  }

  public static void main(String[] args)
  {
    double x = 1.5707;
    int    i;

    try {
      if (args.length != 0)
	x = Double.parseDouble(args[0]);
    }
    catch(Exception e) {}

    float  f = approxF((float)x);
    double d = approxD(x);
    int    e = 100000;

    long t0 = System.nanoTime();

    for(i=0; i<e; i++)
      d = approxD(d);

    long t1 = System.nanoTime();
    
    for(i=0; i<e; i++)
      f = approxF(f);

    long t2 = System.nanoTime();

    t0 = t1-t0;
    t1 = t2-t1;
    
    System.out.printf("double = %d (x %f)\n", t0, (1.f*t0)/t1);
    System.out.printf("float  = %d\n", t1);

    e = 20000;
    d = evalD(10, d);
    f = evalF(10, f);

    t0 = System.nanoTime();

    for(i=10; i<e; i++)
      d += evalD(i, .332);
    
    t1 = System.nanoTime();
    
    for(i=10; i<e; i++) 
      f += evalF(i, .332f);
    
    t2 = System.nanoTime();
    
    t0 = t1-t0;
    t1 = t2-t1;

    System.out.printf("double = %d (x %f)\n", t0, 1.0*t0/t1);
    System.out.printf("float  = %d\n", t1);

    System.out.printf("%f\n", f);
    System.out.printf("%f\n", d);
  }
}

@DzzD: I started programming in the 8-bit days, but that isn’t my problem. I use doubles and multi-precision elements all the time…when needed.

There are three important trends that form my opinion about doubles being unlikely to outperform floats on consumer hardware in my lifetime. The first two have already been mentioned: SIMD and speed gap with main memory. The other is die size reductions. Making the channels narrower requires increased energy consumption (and therefore heat). Thus it is becoming more and more important to shut down subsystems which are not in usage to reduce heat (and battery drain in the case of notebooks). Computation of doubles requires wider data paths and more stages (to slightly vuglarize).

In real life I do a fair amount of low level and optimization work so I read a fair number of CPU design specs and semi-keep up with hardware design research and am seeing nothing to contradict this opinon. However I’ll admit that I never expected to see anything like the new decimal formats from IEEE 754-2008 either.

Riven · August 9, 2009, 4:40pm

Roquen:

@Riven: Showing code that runs faster in float vs. double is easy.

On: Early P4, Last P4, Core Duo (2 core), Core Duo (4 core) - Sun JRE 1.6.0_13-b03:

sin core is between 1.5-1.8 times faster in float.
ortho polynomial is between 1.2-1.5 time faster.
  public static double approxD(double x)
  {
    double x2, r;
    
    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;

    return r;
  }

Do you notice you are mixing doubles and floats here??

Roquen · August 9, 2009, 4:58pm

Opps. I’d like to say that was on purpose, but it wasn’t. Doesn’t change the timing though.

pjt33 · August 9, 2009, 6:55pm

Should that read IEEE 754-2008?

Roquen · August 9, 2009, 7:52pm

Yeap, that should be 754. Probably more commonly known by it’s working title of 754r.

Nate · August 9, 2009, 8:51pm

For sqrt, I tried Float.floatToRawIntBits, but it is slow on Android. I also played with DzzD’s code, but as he indicated, it needs some smarts to be fast enough. Here is Riven’s benchmark code, modified to show DzzD’s sqrt algorithm:

	public static void testSqr (boolean log) {
		int count = 50000;

		long s, e;

		float[] numbers = new float[count];
		for (int i = 0; i < count; i++)
			numbers[i] = (float)(Math.random() * 65536) + 1;
// for (int i = 0; i < count / 2; i++)
// numbers[i] = (float)(Math.random() * -65536);

		// ensure the JVM doesn't optimize those silly calls away!!
		double[] java_dst = new double[count];
		float[] dzzd_dst = new float[count];
		float[] roquen_dst = new float[count];

		s = System.nanoTime();
		for (int i = 0; i < count; i++)
			java_dst[i] = Math.sqrt(numbers[i]);
		e = System.nanoTime();
		if (log) System.out.println("FloatMath.sqrt:  " + (e - s) / 1000 + "us");

		s = System.nanoTime();
		for (int i = 0; i < count; i++)
			roquen_dst[i] = sqrtDzzD(numbers[i]);
		e = System.nanoTime();
		if (log) System.out.println("sqrtDzzD: " + (e - s) / 1000 + "us");

		if (log) {
			double dzzdAvgErr = 0.0;
			double dzzdMaxErr = 0.0;
			for (int i = 0; i < count; i++) {
				double dzzdErr = Math.abs(Math.sqrt(numbers[i]) - sqrtDzzD(numbers[i]));

				dzzdAvgErr += dzzdErr;
				if (dzzdErr > dzzdMaxErr) dzzdMaxErr = dzzdErr;
			}
			dzzdAvgErr /= count;

			System.out.println("Input: " + numbers[3]);
			System.out.println("DzzD: " + sqrtDzzD(numbers[3]) + ", avg.error=" + dzzdAvgErr + ", max.error=" + dzzdMaxErr);
			System.out.println("FloatMath: " + Math.sqrt(numbers[3]));
			System.out.println("~~~prevent opt. ~~~" + dzzd_dst[13] + "~~~" + java_dst[13] + "~~~" + roquen_dst[13]);
			System.out.println();
		}
	}

	static public float sqrtDzzD (float n) {
		if (n == 0) return 0;
		float r = n * 0.5f;
		float r2 = r * r;
		float d = r * 0.5f;
		float error;
		do {
			r += r2 > n ? -d : d;
			r2 = r * r;
			d *= 0.5f;
			if (d == 0) return r;
			error = r2 - n;
		} while ((error < 0 ? -error : error) > 1);
		return r;
	}

Roquen · August 10, 2009, 7:05am

On sqrt I have a couple of questions and maybe I can come up with a different initial guess for a N-R based method.

How many N-R steps can you perform before they equal the speed of the default sqrt? and 1/sqrt?
If they exist are Math.getExponent and Math.scaleb can be used instead of bit inspection, but I’d expect them to be slow as well (worth checking)
Speed of conversion: fp to int and int to fp.
Speed of either lead or trailing zero count of 32 bit ints.

pjt33 · August 10, 2009, 7:41am

Doesn’t change the bytecode - javac will create double entries in the constant pool even though the constants themselves were floats, because they’re extended before they’re used.

Trailing zero count is pretty fast. Code is adapted from http://graphics.stanford.edu/~seander/bithacks.html

private static final int[] trailLut = new int[] {
    0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 
    31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
  };

public static int countTrailingBits(int x) {
  return trailLut[((x & -x) * 0x77cb531) >>> 27];
}

Finding the integer log base 2 of an integer is also covered in the page linked, although it’s a bit slower.

Roquen · August 10, 2009, 7:57am

I finally got ‘unlazy’ and did a quick web search on Android devices. All that I saw are based on the ARM-11. So lead zero counting is a hardware instruction and should be fast (unless someone forgot to hook it in).

I didn’t get any idea about float support. Not being able to find tech specs is a big pevee of mine. The ARM-11 does not have an FPU, but ARM provides various FPUs as coprocessors.

I’ll try to throw together a zero counting base guess version. (on the log-2, I guess you see where I’m going with the zero counting).

pjt33 · August 10, 2009, 10:27am

When you say “forgot to hook it in” - to what? There isn’t a corresponding bytecode, so are you referring to a hypothetical android.util.IntMath? I don’t see why it should be there - the spec surely isn’t designed around a particular CPU?

Re log 2 - yes, it was pretty obvious I implemented a fixed-point sqrt once, but I can’t remember what I did about the initial guess; just that it involved a lookup table.

Roquen · August 10, 2009, 10:54am

That Integer.numberOfLeadingZeros is replaced at link time by a native method, rather than executing the bytecode. I haven’t looked into the guts of Dalvik, but I’d expect it to do this.

Hansdampf · August 10, 2009, 11:04am

for those who need a faster integer sqrt:
http://atoms.alife.co.uk/sqrt/SquareRoot.java
I did not test it under Android. With jre1.6 the performance factor is 2.4 (5.3 with -server)

pjt33 · August 10, 2009, 11:32am

Didn’t realise it was in Integer. Ooh. I’ll have to look and see what else they added in 1.5.

Roquen · August 12, 2009, 1:59pm

This is a quick hack to test a workaround for slow floatToRawIntBits on Android. As I don’t have hardware I can’t do timing tests myself. If someone with hardware is willing to test please try timing “getExponent” against “timeMe”.

Has the following methods: getExponent, getSignificand, scalb and a first pass isqrt. These were thrown together so probably have bugs and isqrt is a quick pass of making a reference version (included) not completely dog-slow.


  /**
   * Float value x = f 2<sup>k</sup>, where 1 &le; f &lt; 2 and k is an integer.
   * <p>
   * Returns k.
   * <p>
   * Differs from JDK version in that the exponents of denormals
   * are returned as if representable.
   * @see #getSignificand(float)
   */

  public static int getExponent(float x)
  {
    int bs = 31;

    if (x < 0)
      x = -x;
    
    if (x < Float.POSITIVE_INFINITY) {
      if (x >= 1) {
        while (x > 0x1p30) {
          x  *= 0x1p-30;
          bs += 30;
        }
      }
      else {
        if (x == 0) return 0;

        do {
          x  *= 0x1p30;
          bs -= 30;
        } while(x < 1);
      }
      
      return bs-Integer.numberOfLeadingZeros((int)x);
    }

    // x is +/-infinity or NaN
    return 128;
  }
  


  /**
   * Float value x = f 2<sup>k</sup>, where 1 &le; f &lt; 2 and k is an integer.
   * <p>
   * Returns abs(f).
   * @see #getExponent(float)
   */
  public static float getSignificand(float x)
  {    
    if (x < 0) x = -x;
    
    // infinity and NaN
    
    if (x >= 1) {
      if (x >= 0x1p30) {
        if (x == Float.POSITIVE_INFINITY) return -1;
        do {
          x *= 0x1p-30;
        } while (x > 0x1p30);
      }
    }
    else {
      if (x == 0) return 0;

      do {
        x *= 0x1p30;
      } while(x < 1);
    }
    
    if (x < 2) return x;
    
    // scale to [1,2)
    int bs = Integer.numberOfLeadingZeros((int)x);
    x  *= (1<<bs) * 0x1p-31f;
    
    return x;
  }

  /**
   * Returns: x * 2<sup>n</sup>
   */
  public static float scalb(float x, int n)
  {
    if (n >= 0) {
      while(n > 30) {
        x *= 0x1p30f;
        n -= 30;
      }
      return x * (1 << n);
    }
    
    while(n < -30) {
      x *= 0x1p-30f;
      n += 30;
    }
    
    return x * (1<<(31+n)) * 0x1p-31f;
  }

  // constants for 1st order inverse sqrt guess
  private static final float a0 =  0x1.439cfep+0f;
  private static final float a1 = -0x1.253f20p-2f;

  // constants for 2nd order inverse sqrt guess
  private static final float b0 =  0x1.946338p+0f;
  private static final float b1 = -0x1.7605f6p-1f;
  private static final float b2 =  0x1.2e76d0p-3f;
  
  /** sqrt(2)/2 */
  private static final float sqrt2o2 = 0x1.6a09e6p-1f;
  
  // reference version: shows what's happening in isqrt.
  public static float isqrtRef(float x)
  {
    // get same info as Float.floatToRaw bits
    // (but in fp instead of integer)
    float man = getSignificand(x);
    int   exp = getExponent(x);
    
    float r;

    // make an initial guess:  1 <= man < 2
    // two examples:
  //r =  a0 + a1 * man;
    r =  b0 + man*(b1 + man*b2);
   
    // add in the 1/sqrt(2^(k/2)) term
    if (exp != 0) {
      r = scalb(r,-(exp>>1));
      if ((exp & 1) != 0) r *= sqrt2o2;
    }
    
    float hx  = x * 0.5f;

    // perform some number of N-R steps
    r = r * (1.5f - hx * r * r);
    r = r * (1.5f - hx * r * r);
    r = r * (1.5f - hx * r * r);
    
    return r;
  }
  
  // temp hack for unoptimized isqrt
  private static float scalb_temp_hack(float x, int n)
  {
    if (n >= 0)
      return x * (1 << n);
  
    return x * (1<<(31+n)) * 0x1p-31f;
  }
  
  public static float isqrt(float y)
  {
    float x  = y;
    int   bs = 31;

    // scale if needed to bring into integer range
    if (x >= 1) {
      while (x > 0x1p30) {
        x  *= 0x1p-30;
        bs += 30;
      }
    }
    else {
      if (x == 0) return 0;

      do {
        x  *= 0x1p30;
        bs -= 30;
      } while(x < 1);
    }

    // isolate exponent and significand
    int   lzc = Integer.numberOfLeadingZeros((int)x);
    int   exp = bs-lzc;
    float man = x;
    float hx = 0.5f*y;
    
    if (x >= 2)
      man  *= (1<<lzc) * 0x1p-31f;
    
    // make an initial guess
     float r =  a0 + a1 * man;
   //float r =  b0 + man*(b1 + man*b2);

    // add in the 1/sqrt(2^(k/2)) term
    if (exp != 0) {
      r = scalb_temp_hack(r, -(exp >> 1));
      if ((exp & 1) != 0) r *= sqrt2o2;
    }

    // perform some number of N-R steps
    r = r * (1.5f - hx * r * r);
    //r = r * (1.5f - hx * r * r);
    //r = r * (1.5f - hx * r * r);
    
    return r;
  }
   
  @SuppressWarnings("boxing")
  public static float timeSqrt()
  {
    float x,r0=0;
    x = 65536;

    long t0,t1;

    // test some large values
    x  = 0x1.133321p122f;
    t0 = System.nanoTime();
    
    do {
    //r0 += (1.f / (float) Math.sqrt(x));
      r0 += isqrt(x);
      x   = Math.nextUp(x);
    } while (x <= 0x1.133321p122f * 2);

    t1 = System.nanoTime();
    
    System.out.printf("time (big)   = %f (%f)\n", (t1-t0)*(1/1000000.0), r0);
    
    // test some small values
    x  = 0x1.133321p-122f;
    t0 = System.nanoTime();
    
    do {
    //r0 += (1.f / (float) Math.sqrt(x));
      r0 += isqrt(x);
      x   = Math.nextUp(x);
    } while (x <= 0x1.133321p-122f * 2);

    t1 = System.nanoTime();
    
    System.out.printf("time (small) = %f (%f)\n", (t1-t0)*(1/1000000.0), r0);
    
    return r0;
  }

  public static float timeMe(float f)
  {
    return Float.intBitsToFloat(Float.floatToRawIntBits(f) + 1);
  }

Nate · August 15, 2009, 4:46am

It took me a bit to figure out the 1.6 Math.nextUp call was causing Android to throw VerifyError!

Here are the milliseconds I got on the G1 with 10000 iterations (passing the i to the method):

Maybe FloatMath.sqrt is the fastest we are going to get? I don’t have any proof it is a bottleneck.

[quote]FloatMath.sqrt: 49.102783
isqrt: 134.52148

FloatMath.sqrt: 45.98999
isqrt: 139.4043
[/quote]
What do you guys think of these methods? I forgot where I scrounged them up.

	/**
	 * Fixed point multiply.
	 */
	static public int multiply (int x, int y) {
		long z = (long)x * (long)y;
		return ((int)(z >> 16));
	}

	/**
	 * Fixed point divide.
	 */
	static public int divide (int x, int y) {
		return (int)(((((long)x) << 32) / y) >> 16);
	}

pjt33 · August 15, 2009, 10:41am

Not sure why you wouldn’t just do:

	/**
	 * Fixed point divide.
	 */
	static public int divide (int x, int y) {
		return (int)((((long)x) << 16) / y);
	}

Not that I tend to favour 16.16 fix-point anyway. 24.8 is usually good enough and most of the time it allows you to fit intermediate results in 32 bits.

Nate · August 15, 2009, 7:39pm

Thanks pjt33. Your method seems to work just as well and is simpler.

In general, I haven’t found using fixed point to be any faster than just using floats, even though the G1 doesn’t have an FPU. I only use fixed point with OpenGL ES, which is 16.16. I found any real number crunching is going to have to be native code. I didn’t try using fixed point in native code as it doesn’t seem to be a bottleneck.