Arithmictic Performance

DzzD · August 8, 2009, 12:10pm

about sqrt

as for me it is mainly used to normalize vector I use a lookup table of int where lookup[n]=2147483647/sqrt(n*k) //k is greater than 1 and allow to scale the range of input

it also exist a recurrent algorithm using dicotomic approch something like :

    static double approxSqrt(double n,double error)
    {
		double r=n/2;
		double r2=r*r;
		double d=r*0.5;
		int loop=0;
		do
		{
			System.out.println("("+loop+++") r="+r);
			if(r2>n)
 				r-=d;
			else
				r+=d;
			r2=r*r;
			d*=0.5;
		}
		while(Math.abs(r2-n)>error);
		return r;
	}

output for approxSqrt(100,0.1)

EDIT:

and the same sor int :

static int intSQRT(int input)
	{
		int r=input>>1;
		int r2=r*r;
		int d=r>>1;
		int loop=0;
		do
		{
			System.out.println("("+loop+++") r="+r + " r*r=" +(r*r));
			r+=(r2>input)?-d:d;
			r2=r*r;
			d>>=1;
		}
		while(r2!=input && d!=0);
		return r;
		
	}

output for sqrt(100);

this is only the base idea, the first value of r should be optimised by using a lookup table or anyother way to make it smarter

Roquen · August 8, 2009, 1:13pm

A better sin minimax example, the maximum error is 0x1.8p-23 at x ~1.466838 (again on +/- pi/2)


    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;

DzzD · August 8, 2009, 1:19pm

Roquen:

A better sin minimax example, the maximum error is 0x1.8p-23 at x ~1.466838 (again on +/- pi/2)


    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;

ok but so the original java cos error is greater, no ? also the version I usually use is based on double not float (wich for the computer I have tested run faster than float)

EDIT :

System.out.println("Math.cos(Math.PI*0.5))="+ Math.cos(Math.PI*0.5)  );

ouput this :
Math.cos(Math.PI*0.5))=6.123233995736766E-17

and should be 0.0

Roquen · August 8, 2009, 2:14pm

The “error” you’re seeing in an argument reduction issue, Math.PI/2 is not pi/2. (I agree your version make more sense). Example:

double d = Math.PI*0.5;
System.out.printf("%s\n", Double.toHexString(Math.cos(d)));
System.out.printf("%s\n", Double.toHexString(Math.cos(Math.nextUp(d))));

Most hardware will execute floats faster than doubles (notable counter-example is intel-a-likes using x87 instead of SSE).

Note, I’m only attempting to compare truncated power-series vs. other polynomial approximation methods. Truncated power series ignore finite precision and are centered on a single point and most NA methods take into account finite precision and target a range. I’m ignoring stuff like argument reduction and dealing with any special values (NaN, +/-Infinity, denormals, -zero)

Stuck a sqrt(x) and 1/sqrt(x) here: http://www.java-gaming.org/index.php/topic,20997.0.html

DzzD · August 8, 2009, 2:56pm

sorry but I dont agree, IMHO double is probably faster or at least same speed (and for sure will become faster)

you can check a nice discussion on gamedev about that (back in 2005) : http://www.gamedev.net/community/forums/topic.asp?topic_id=343479

according to this discussion for example double is native on an Xbox360

Roquen · August 8, 2009, 3:24pm

Most of these comments are ill informed. For example an SSE machine can perform 4 float ops faster than the 2 double ops that will fix in the same registers. Example - dividing 4x4 floats has a throughput of 36 cycles and 2x2 doubles is 62 (for CPUID 69) - so 9 cycles/divide for float and 31 cycles/divide for doubles. I know of no hardware in which doubles are faster than floats and don’t expect to see it in my lifetime (as least for consumer hardware).

DzzD · August 8, 2009, 5:00pm

I played with Riven lookup version and try a little improvment on accuracy by adding linear interpolation on results, it is twice slower but a little bit more accurate

I juste replaced :

 public static final float cos(float rad)
   {
      return cos[(int) (rad * radToIndex) & SIN_MASK];
   }

by :

public static final float cos(float rad)
   {
   	
   	  int idx=(int) (rad * radToIndex * 256f);
   	  
   	  float r=(idx&0xFF)*0.00390625f;
   	  idx>>=8;
   	  float c1=cos[idx & SIN_MASK];
   	  float c2=cos[idx+1 & SIN_MASK];
   	  return c1*(1f-r)+c2*r;
   	  
      
   }

and re-run the bench code

it give the following result :

before :
time between 290 to 300 us
Riven: -0.78881806, avg.error=6.071037300951922E-4, max.error=0.0022978310394030435

after adding linear interpolation :
time between 550 to 560 us
Riven: 0.9108251, avg.error=4.901244049417835E-4, max.error=7.731781248493941E-4

EDIT : maybe not really usefull but it make it act as a continuaous function rather than a sampled one

Riven · August 8, 2009, 5:15pm

Right… forgetting the fact that Java has no SIMD? Besides that, the statement ‘has a throughput of 36 cycles’ makes no sense, at all.

Benchmark to show that float / double performance is nearly identical:


 public static void main(String[] args)
   {
      int elems = 1024;
      int runs = 16;

      Random r = new Random();

      for (int k = 0; k < 8; k++)
      {
         float[] f1 = new float[elems];
         float[] f2 = new float[elems];
         float[] f4x4 = new float[16];
         for (int i = 0; i < 16; i++)
            f4x4[i] = r.nextFloat();
         long float_ms = benchFloat(f1, f2, f4x4, elems, runs);
         System.out.println("float performance: " + float_ms / 1000000L + "ms (midpoint)");

         double[] d1 = new double[elems];
         double[] d2 = new double[elems];
         double[] d4x4 = new double[16];
         for (int i = 0; i < 16; i++)
            d4x4[i] = r.nextDouble();
         long double_ms = benchDouble(d1, d2, d4x4, elems, runs);
         System.out.println("double performance: " + double_ms / 1000000L + "ms (midpoint)");
      }
   }

   public static long benchFloat(float[] f1, float[] f2, float[] mat, int elems, int runs)
   {
      long[] ts = new long[runs];
      for (int i = 0; i < ts.length; i++)
      {
         long a = System.nanoTime();
         for (float t1 = 0.0f; t1 < 1.0f; t1 += 0.01f)
         {
            for (float t2 = 0.0f; t2 < 1.0f; t2 += 0.02f)
            {
               fiddleFloat(t1, t2, elems, f1, f2, mat);
            }
         }
         long b = System.nanoTime();
         ts[i] = b - a;
      }
      Arrays.sort(ts);
      return ts[ts.length / 2];
   }

   public static long benchDouble(double[] d1, double[] d2, double[] mat, int elems, int runs)
   {
      long[] ts = new long[runs];
      for (int i = 0; i < ts.length; i++)
      {
         long a = System.nanoTime();
         for (double t1 = 0.0; t1 < 1.0; t1 += 0.01f)
         {
            for (double t2 = 0.0; t2 < 1.0; t2 += 0.02f)
            {
               fiddleDouble(t1, t2, elems, d1, d2, mat);
            }
         }
         long b = System.nanoTime();
         ts[i] = b - a;
      }
      Arrays.sort(ts);
      return ts[ts.length / 2];
   }

   public static float fiddleFloat(float t1, float t2, int elems, float[] op1, float[] op2, float[] m4x4)
   {
      float sum = 0.0f;
      for (int i = 0; i < elems; i++)
      {
         float f1 = op1[i];
         float f2 = op2[i];
         float diff1 = f2 - f1;
         float f3 = t1 * diff1 + f1;
         float diff2 = f3 - f2;

         sum += t2 * diff2 + f2;
         sum -= m4x4[0] * f1 + m4x4[1] * f2 + m4x4[2] * f3 + m4x4[3];
         sum += m4x4[4] * f1 + m4x4[5] * f2 + m4x4[6] * f3 + m4x4[7];
         sum -= m4x4[8] * f1 + m4x4[9] * f2 + m4x4[10] * f3 + m4x4[11];
      }
      return sum;
   }

   public static double fiddleDouble(double t1, double t2, int elems, double[] op1, double[] op2, double[] m4x4)
   {
      double sum = 0.0f;
      for (int i = 0; i < elems; i++)
      {
         double f1 = op1[i];
         double f2 = op2[i];
         double diff1 = f2 - f1;
         double f3 = t1 * diff1 + f1;
         double diff2 = f3 - f2;

         sum += t2 * diff2 + f2;
         sum -= m4x4[0] * f1 + m4x4[1] * f2 + m4x4[2] * f3 + m4x4[3];
         sum += m4x4[4] * f1 + m4x4[5] * f2 + m4x4[6] * f3 + m4x4[7];
         sum -= m4x4[8] * f1 + m4x4[9] * f2 + m4x4[10] * f3 + m4x4[11];
      }
      return sum;
   }


float performance: 51ms (midpoint)
double performance: 52ms (midpoint)
float performance: 75ms (midpoint)
double performance: 52ms (midpoint)
float performance: 50ms (midpoint)
double performance: 51ms (midpoint)
float performance: 50ms (midpoint)
double performance: 52ms (midpoint)
float performance: 50ms (midpoint)
double performance: 50ms (midpoint)
float performance: 50ms (midpoint)
double performance: 51ms (midpoint)
float performance: 50ms (midpoint)
double performance: 52ms (midpoint)
float performance: 50ms (midpoint)
double performance: 72ms (midpoint)

Roquen · August 9, 2009, 11:53am

There’s a wide misconception that doubles are somehow ‘better’ than floats and this is simply not the case. They simply have more precision and you pay for that in terms of memory usage and (usually) execution time (some ops will have same execution time for both). It’s true that for some usages the extra precision is required, but this is likewise true about doubles. I find this somewhat strange because you don’t find people thinking that 64-bit integers are somehow ‘better’ than lower bit width integers and they will happly use the one which in most appropriate for the given usage.

Not at all, they are simply not exposed at the high level. It’s the compiler’s job to use these instructions (runtime in the case of a VM). I’d expect vectorized support to improve as runtimes like Mono have been working on this front. It’ll probably require a new compiler framework as Sun’s seems to be showing it’s age (I haven’t paid attention to what the next gen is up to.) Both vmkit and Shark/Zero are based on LLVM, so they have “promise” in this respect. (Although LLVM needs work on auto-vectorization as well.) If we had a VM that performed a reasonable amount of vectorization, there would be a huge speed difference between floats and doubles on SIMD hardware. But ignoring SIMDifing, scalar floating point operations are generally faster than double (and never slower) for all CPU architectures that I’m familiar with. Couple this with the ever increasing speed gap between main memory and the CPU, moving data becomes a more and more of an issue.

Not sure what doesn’t make sense. These are the measure of time required for the executional unit to complete the computation.

Your example appears to be memory bound. As a simple counter-example: the speed difference between float and double multiple is much narrower than divide, but if you take my last minimax sin example and simply change from float to double you’re likely to see a speed difference on 1.2-2.0x depending on your hardware (and assuming Sun VM).

The question of performance of float vs. doubles is really a hardware question (and the indicated thread was about hardware) and my orginal statement was to that effect. My follow-up was in reply to “IMHO double is probably faster or at least same speed (and for sure will become faster)” which I claim to be incorrect. Is it possible to have two otherwise identical routines where a double version will be faster then the float? Yes, but that will be in the case where stalls of the double version are being more hidden by some other stall (probably memory read or write) that float version. This will be a rare occurance and tightly coupled to an exact hardware configuration. This is more of a desktop issue for CPU with many functional units.

I’d be happy to be proven wrong with a counter-example or pointer to CPU specification!

Riven · August 9, 2009, 12:07pm

@SIMD

FYI, the Sun JVM doesn’t have any SIMD code. Maybe in System.arraycopy() but that’s it.

Because ‘throughput’ cannot be expressed in ‘cycles’. Throughput is the inverse of the amount of cycles it takes to perform some operation.

Appears. Yeah. Vague handwaving is a perfectly solid foundation of an argument. To make all data fit in cache, reduce the ‘elems’ variable and test again… : I hope I’m not asking too much.

Further, more than 75% of the operations are on [local variables + a 4x4 matrix], if that is going to be memory bound…

Running with 256 ‘elems’ (2x float[256] = 2K RAM and 2x double[256] = 4K RAM)


float performance: 12ms (midpoint)
double performance: 13ms (midpoint)
float performance: 12ms (midpoint)
double performance: 13ms (midpoint)
float performance: 19ms (midpoint)
double performance: 19ms (midpoint)
float performance: 12ms (midpoint)
double performance: 12ms (midpoint)
float performance: 12ms (midpoint)
double performance: 12ms (midpoint)
float performance: 13ms (midpoint)
double performance: 19ms (midpoint)
float performance: 19ms (midpoint)
double performance: 19ms (midpoint)
float performance: 19ms (midpoint)
double performance: 19ms (midpoint)

DzzD · August 9, 2009, 12:18pm

[quote]I’d be happy to be proven wrong with a counter-example or pointer to CPU specification!
[/quote]
I will try to when I get some time

the transition from float to double is just the same story as the old transition (previous to FPU born) of cpu word from 16bit to 32bit . and the present/near futur is transition from 32 to 64 bit.

15/20 years ago accessing a 16bit word was faster and todays it is becomed slower as all the architecture have now reached and have been thinked for 32bit

Roquen · August 9, 2009, 4:16pm

@Riven: Showing code that runs faster in float vs. double is easy.

On: Early P4, Last P4, Core Duo (2 core), Core Duo (4 core) - Sun JRE 1.6.0_13-b03:

sin core is between 1.5-1.8 times faster in float.
ortho polynomial is between 1.2-1.5 time faster.


public class Foo
{
  public static float approxF(float x)
  {
    float x2, r;
    
    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;

    return r;
  }

  public static double approxD(double x)
  {
    double x2, r;
    
    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;

    return r;
  }

  public static double evalD(int n, double x)
  {
    double r = 1;
    double a = 1;
    double b = x;
    int    i;
    
    if (n >= 2) {
      for (i = 2; i <= n; i++) {
        r = ((2 * i - 1) * x * b - (i - 1) * a) / i;
        a = b;
        b = r;
      }
      return r;
    }
    
    if (n == 0)
      return a;
    
    return b;
  }

  public static float evalF(int n, float x)
  {
    float r = 1;
    float a = 1;
    float b = x;
    int   i;
    
    if (n >= 2) {
      for (i = 2; i <= n; i++) {
        r = ((2 * i - 1) * x * b - (i - 1) * a) / i;
        a = b;
        b = r;
      }
      return r;
    }
    
    if (n == 0)
      return a;
    
    return b;
  }

  public static void main(String[] args)
  {
    double x = 1.5707;
    int    i;

    try {
      if (args.length != 0)
	x = Double.parseDouble(args[0]);
    }
    catch(Exception e) {}

    float  f = approxF((float)x);
    double d = approxD(x);
    int    e = 100000;

    long t0 = System.nanoTime();

    for(i=0; i<e; i++)
      d = approxD(d);

    long t1 = System.nanoTime();
    
    for(i=0; i<e; i++)
      f = approxF(f);

    long t2 = System.nanoTime();

    t0 = t1-t0;
    t1 = t2-t1;
    
    System.out.printf("double = %d (x %f)\n", t0, (1.f*t0)/t1);
    System.out.printf("float  = %d\n", t1);

    e = 20000;
    d = evalD(10, d);
    f = evalF(10, f);

    t0 = System.nanoTime();

    for(i=10; i<e; i++)
      d += evalD(i, .332);
    
    t1 = System.nanoTime();
    
    for(i=10; i<e; i++) 
      f += evalF(i, .332f);
    
    t2 = System.nanoTime();
    
    t0 = t1-t0;
    t1 = t2-t1;

    System.out.printf("double = %d (x %f)\n", t0, 1.0*t0/t1);
    System.out.printf("float  = %d\n", t1);

    System.out.printf("%f\n", f);
    System.out.printf("%f\n", d);
  }
}

@DzzD: I started programming in the 8-bit days, but that isn’t my problem. I use doubles and multi-precision elements all the time…when needed.

There are three important trends that form my opinion about doubles being unlikely to outperform floats on consumer hardware in my lifetime. The first two have already been mentioned: SIMD and speed gap with main memory. The other is die size reductions. Making the channels narrower requires increased energy consumption (and therefore heat). Thus it is becoming more and more important to shut down subsystems which are not in usage to reduce heat (and battery drain in the case of notebooks). Computation of doubles requires wider data paths and more stages (to slightly vuglarize).

In real life I do a fair amount of low level and optimization work so I read a fair number of CPU design specs and semi-keep up with hardware design research and am seeing nothing to contradict this opinon. However I’ll admit that I never expected to see anything like the new decimal formats from IEEE 754-2008 either.

Riven · August 9, 2009, 4:40pm

Roquen:

@Riven: Showing code that runs faster in float vs. double is easy.

On: Early P4, Last P4, Core Duo (2 core), Core Duo (4 core) - Sun JRE 1.6.0_13-b03:

sin core is between 1.5-1.8 times faster in float.
ortho polynomial is between 1.2-1.5 time faster.
  public static double approxD(double x)
  {
    double x2, r;
    
    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;

    return r;
  }

Do you notice you are mixing doubles and floats here??

Roquen · August 9, 2009, 4:58pm

Opps. I’d like to say that was on purpose, but it wasn’t. Doesn’t change the timing though.

pjt33 · August 9, 2009, 6:55pm

Should that read IEEE 754-2008?

Roquen · August 9, 2009, 7:52pm

Yeap, that should be 754. Probably more commonly known by it’s working title of 754r.

Nate · August 9, 2009, 8:51pm

For sqrt, I tried Float.floatToRawIntBits, but it is slow on Android. I also played with DzzD’s code, but as he indicated, it needs some smarts to be fast enough. Here is Riven’s benchmark code, modified to show DzzD’s sqrt algorithm:

	public static void testSqr (boolean log) {
		int count = 50000;

		long s, e;

		float[] numbers = new float[count];
		for (int i = 0; i < count; i++)
			numbers[i] = (float)(Math.random() * 65536) + 1;
// for (int i = 0; i < count / 2; i++)
// numbers[i] = (float)(Math.random() * -65536);

		// ensure the JVM doesn't optimize those silly calls away!!
		double[] java_dst = new double[count];
		float[] dzzd_dst = new float[count];
		float[] roquen_dst = new float[count];

		s = System.nanoTime();
		for (int i = 0; i < count; i++)
			java_dst[i] = Math.sqrt(numbers[i]);
		e = System.nanoTime();
		if (log) System.out.println("FloatMath.sqrt:  " + (e - s) / 1000 + "us");

		s = System.nanoTime();
		for (int i = 0; i < count; i++)
			roquen_dst[i] = sqrtDzzD(numbers[i]);
		e = System.nanoTime();
		if (log) System.out.println("sqrtDzzD: " + (e - s) / 1000 + "us");

		if (log) {
			double dzzdAvgErr = 0.0;
			double dzzdMaxErr = 0.0;
			for (int i = 0; i < count; i++) {
				double dzzdErr = Math.abs(Math.sqrt(numbers[i]) - sqrtDzzD(numbers[i]));

				dzzdAvgErr += dzzdErr;
				if (dzzdErr > dzzdMaxErr) dzzdMaxErr = dzzdErr;
			}
			dzzdAvgErr /= count;

			System.out.println("Input: " + numbers[3]);
			System.out.println("DzzD: " + sqrtDzzD(numbers[3]) + ", avg.error=" + dzzdAvgErr + ", max.error=" + dzzdMaxErr);
			System.out.println("FloatMath: " + Math.sqrt(numbers[3]));
			System.out.println("~~~prevent opt. ~~~" + dzzd_dst[13] + "~~~" + java_dst[13] + "~~~" + roquen_dst[13]);
			System.out.println();
		}
	}

	static public float sqrtDzzD (float n) {
		if (n == 0) return 0;
		float r = n * 0.5f;
		float r2 = r * r;
		float d = r * 0.5f;
		float error;
		do {
			r += r2 > n ? -d : d;
			r2 = r * r;
			d *= 0.5f;
			if (d == 0) return r;
			error = r2 - n;
		} while ((error < 0 ? -error : error) > 1);
		return r;
	}

Roquen · August 10, 2009, 7:05am

On sqrt I have a couple of questions and maybe I can come up with a different initial guess for a N-R based method.

How many N-R steps can you perform before they equal the speed of the default sqrt? and 1/sqrt?
If they exist are Math.getExponent and Math.scaleb can be used instead of bit inspection, but I’d expect them to be slow as well (worth checking)
Speed of conversion: fp to int and int to fp.
Speed of either lead or trailing zero count of 32 bit ints.

pjt33 · August 10, 2009, 7:41am

Doesn’t change the bytecode - javac will create double entries in the constant pool even though the constants themselves were floats, because they’re extended before they’re used.

Trailing zero count is pretty fast. Code is adapted from http://graphics.stanford.edu/~seander/bithacks.html

private static final int[] trailLut = new int[] {
    0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 
    31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
  };

public static int countTrailingBits(int x) {
  return trailLut[((x & -x) * 0x77cb531) >>> 27];
}

Finding the integer log base 2 of an integer is also covered in the page linked, although it’s a bit slower.

Roquen · August 10, 2009, 7:57am

I finally got ‘unlazy’ and did a quick web search on Android devices. All that I saw are based on the ARM-11. So lead zero counting is a hardware instruction and should be fast (unless someone forgot to hook it in).

I didn’t get any idea about float support. Not being able to find tech specs is a big pevee of mine. The ARM-11 does not have an FPU, but ARM provides various FPUs as coprocessors.

I’ll try to throw together a zero counting base guess version. (on the log-2, I guess you see where I’m going with the zero counting).