floating performance

I tried to implement a frame scaling in a float, just to see how it would perform. It was funny I disabled it after few tenths of seconds and tried it with double. It was still 2x slower than packed signed bytes in INT, but it was considerably faster than float. I looked at some of my old benchmarks, and wow float arithmetic is slower than double. It doesn’t matter for me, doubles ar much more precisse than floats, and are more useful for calculations, but I remmember on something from XITH3D manual. “We are using floats, because they are faster…”

So I wonder does have the float performance some reasons, or it’s just leftover in 5.0?

Most CPUs these days support double operation in silico, but not float operations. Therefore, for float operations floats will be converted to doubles internally, then reconverted to floats which takes some time (not much, though).

The only factor in favour of floats these days is the lower memory consumption for applications with large floating point arrays.

Also, with a platform-independent system like Java, I wouldn’t bet on any implementation-specific performance paradigms such as ‘always use floats’ (or ‘always use doubles’, for that matter, or the ‘avoid object creation at all costs’ which was popular some while ago). The various VM implementations are always good for bizarre surprises performance-wise, and things that work splendidly on one platform fail abysmally on another.

[quote]Most CPUs these days support double operation in silico, but not float operations.
[/quote]
:o

… really? Doubles are 64 bits.

The main reason for using floats is bus bandwidth when you’re blasting things down to a graphics card. For calculations, doubles may well be faster on current VM/CPU architectures. Dunno about the Mac mind you/

Cas :slight_smile:

Interesting!

Seems like my last remaining preemptive optimisation (“Of course I’ll speed up my app by using floats!”) has been shot down as well.

Actually Intel’s CPUs are using 80 bits. They have floating point registers aliased with MMX registers, and they are using revolver like structure. (This doesn’t mean you could use assembly and wow you would have full 80 bits of precission, these lowest bits are just for avoiding bad rounding errors. ) So conversion from and to floats shouldn’t be as bad as 120/40, or is it?

This remminds me I should look somewhere on FP numbers in the NVIDIA FX5700. It has some internal support for them, but I forgot its precision and maximum range. Are they in 10 - 6 format, or are they in a some variation on FP32 format.

Average taken of 100 loops of test run of 1,000,000 calculations of:

result = value * result
result = value / result

Tested with two values (2 runs): 0.00001 and .234323 of the type being tested. No difference in time results from different values. Times are in milleseconds on an AMD2600.

Java 1.4.2

** Float Test **
Average Time: 12

of divisions: 1000000

Average Time: 6

of multiplications: 1000000

** Double Test **
Average Time: 13

of divisions: 1000000

Average Time: 6

of multiplications: 1000000

Java 1.5

** Float Test **
Average Time: 11

of divisions: 1000000

Average Time: 6

of multiplications: 1000000

** Double Test **
Average Time: 13

of divisions: 1000000

Average Time: 6

of multiplications: 1000000

The results seem clear to me… Even if double’s were faster (which at least in Java doesn’t seem to be the case. Division was slower and mutiplication was same), floats would be better for 3D graphics anyways because it is half the amount of data to transfer on the bus.

In C/C++ the preference of floats over doubles comes strictly from the bus issues of transfering around doubles. Alot of the point comes from nVidia who is a strong proponent for floats over doubles (or even half-floats over floats). The performance point seems to be even more true in Java.

Sorry but I don’t believe your results.
Not because you are intentionally trying to give bad results, but because micro benchmarks can’t be trusted when you give out little information about them.

I’m going to try my own results later on to see if I can get faster float performance vs double performance on my A64 3K+.

[quote]Sorry but I don’t believe your results.
Not because you are intentionally trying to give bad results, but because micro benchmarks can’t be trusted when you give out little information about them.

I’m going to try my own results later on to see if I can get faster float performance vs double performance on my A64 3K+.
[/quote]
It’s not a [edit]bad[/edit] microbenchmark. I did 100 million calculations in each loop before determining the final times using another run of 100 million calcs and ran the test several times and varied the values to calculate and had identical results…but run your own.

I’m seeing some weird floating point benchmark results on my Athlon 1.4 GHz using Java 5 -server. I’m benchmarking a very simple loop using floats, doubles and fixed point. The fixed point loop is as fast as similar C code compiled with GCC or Visual Studio, but the floating point math is running at half the speed of similar C code. That is the same speed as when “Strict” floating point is turned on in Visual Studio. Here’s the loop I’m benchmarking:


private static final float x = 0.7456f;
private static final float y = 0.97543f;
private static final int count = 100000;
private static float f1;

void runTest() {
        float a = x;
        float b = y;

        for (int i=0; i<count; i++) {
          b = a * b + a;
        }

        f1 = b;
}

The same code using doubles is even slower. Running the benchmark in JRockit produces the same results as compiled C code = about twice as fast as Hotspot server. Why can’t Hotspot optimize the Java code as well as a C compiler?

100k iterations is very few. The overhead of compliling that into native code (if it even bothers doing so… enable some profiling) will take up a large percentage of the total time spent in that loop.

point being:
It quite possibly already is compiling that loop into native code as fast as a c compiler, but your microbenchmark is broken.

Doubles have higher performance.
Replace the variables used in the calculation with ‘float’ to bench the float performance.

Float performance = 22 seconds.
Double performance = 13 seconds.

I ran the bench a few times for consistency.

Athlon64 3000+ @ 2109.34MHz, 1GB RAM PC 3600, Nforce 3 mobo.


private static final double x = 0.7456f;
      private static final double y = 0.97543f;
      private static final long count = 1000000000;
      private static double f1;
      
      
      /**
       * @param args
       */
      public static void main( String[] args )
      {
            Main main = new Main();
            double fps, start;
            
            start = System.nanoTime();
            main.runTest();
            fps =  (System.nanoTime() - start) / 1e9;
            
            System.out.println(fps);
      }

      

      void runTest()
      {
            double a = x;
            double b = y;

            for ( int i = 0; i < count; i++ )
            {
                  b = (a * b + a) / 2;
            }

            f1 = b;
      }

[quote]Doubles have higher performance.
Replace the variables used in the calculation with ‘float’ to bench the float performance.

Float performance = 22 seconds.
Double performance = 13 seconds.

I ran the bench a few times for consistency.
[/quote]
Try to start your app with -Xcompile and see what happens to your results…
You are obviously measuring a difference in hotspot’s handling of floats vs. doubles here, not a real performance plus of doubles.

With XCompile:

Float: 12.944601059 seconds.
Double: 13.299814082 seconds.

[quote]100k iterations is very few. The overhead of compliling that into native code (if it even bothers doing so… enable some profiling) will take up a large percentage of the total time spent in that loop.

point being:
It quite possibly already is compiling that loop into native code as fast as a c compiler, but your microbenchmark is broken.
[/quote]
It’s not broken, on the contrary I’m quite certain that my benchmark is correct. The code I posted is the loop I’m benchmarking, it’s not the entire benchmark application. I do 10 s warmup and 10 s benchmarking.

My tests seem to indicate that there is a flaw in the Hotspot optimizer.

s/your microbenchmark is broken/the code you posted is broken/

:wink:

When I run a simple micro benchmark on a P4, double performance is consistently slightly slower after warm-up.

It could be that this is an Athlon-only issue though (I’ve seen cases before where some numeric operations run slower than on a P4 because of some P4 specific shortcuts which are impossible on an Athlon), but I have the feeling the results are misleading somehow. I have to test at home (where I have an Athlon too).

Could you post the entire benchmark?

[quote] Why can’t Hotspot optimize the Java code as well as a C compiler?
[/quote]
It can. As a matter of fact I once converted a little C/ASM fractal program to Java where the java version ran as fast as the ASM version of the program and even faster than the C compiled one. It surprised me almost as much as the author of the original program who wanted to show how much faster C is compared to java. :slight_smile:

Ok, I’ve put together the entire benchmark into a single class (see below). On Java 5 -server I get:

Float: 776 iterations / s
Double: 642 iterations / s
Fixed point: 1793 iterations / s
2.9308171, 2.8822784, 2.9308173801885786

With JRockit I get:

Float: 1572 iterations / s
Double: 1575 iterations / s
Fixed point: 1800 iterations / s
2.9308174, 2.8822784, 2.9308173801885786

Note the 2x speed improvement in the float and double benchmarks. I get the same scores for a similar C test compiled with GCC or Visual Studio.

It’s a VERY simple loop to optimize (fmul, fadd, jump in the assembler output from GCC), so it’s really surprising that Hotspot can’t optimize it properly. You’re more than welcome to try to tweak the code to make it run fast in Hotspot.


public class MathTest {
  private static final float x = 0.7456f;
  private static final float y = 0.97543f;
  private static final int count = 100000;
  private static float f1;
  private static float f2;
  private static double f3;

  private static void run(String text, Runnable runnable) {
    long time = System.currentTimeMillis();

    while (System.currentTimeMillis() - time < 10000) {
      runnable.run();
    }

    time = System.currentTimeMillis();
    long count = 0;

    while (System.currentTimeMillis() - time < 10000) {
      runnable.run();
      count++;
    }

    System.out.println(text + ": " + count * 1000L / (System.currentTimeMillis() - time) + " iterations / s");
  }

  public static void main(String[] args) {
    run("Float", new Runnable() {
      public void run() {
        float a = x;
        float b = y;

        for (int i=0; i<count; i++) {
          b = a * b + a;
        }

        f1 = b;
      }
    });
    run("Double", new Runnable() {
      public void run() {
        double a = x;
        double b = y;

        for (int i=0; i<count; i++) {
          b = a * b + a;
        }

        f3 = b;
      }
    });
    run("Fixed point", new Runnable() {
      public void run() {
        int a = (int) (x * 65536.0f);
        int b = (int) (y * 65536.0f);

        for (int i=0; i<count; i++) {
          b = (a >> 8) * (b >> 8) + a;
        }

        f2 = (float) b / 65536.0f;
      }
    });

    System.out.println(f1 + ", " + f2 + ", " + f3);
  }
}

Your benchmark gave these results on my Athlon 2200 on the server VM:
Float: 2230 iterations / s
Double: 941 iterations / s
Fixed point: 2553 iterations / s
2.9308171, 2.8822784, 2.9308173801885786

A slightly modified version of your benchmark gave these results (after some warm-up):
Float : 6690 iterations / s
Double : 3975 iterations / s
Fixed : 27322 iterations / s
2.9308171, 2.8822784, 2.9308173801885786

So whatever it’s causing this difference, I don’t know, but the only thing you can conclude from your benchmark is that JRockit optimizes it better (for whatever that’s worth).

The modified (uglified ;D) version:


public class MathTest2 {
      private float x = 0.7456f;
      private float y = 0.97543f;
      private int count = 100000;
      private float f1;
      private float f2;
      private double f3;

      public void runFloat() {
            long time = System.currentTimeMillis();
            int count = 0;
            while (System.currentTimeMillis() - time < 10000) {
                  float a = x;
                  float b = y;

                  for (int i = 0; i < count; i++) {
                        b = a * b + a;
                  }

                  f1 = b;
                  count++;
            }
            System.out.println(
                  "Float : " + count * 1000L / (System.currentTimeMillis() - time) + " iterations / s");
      }

      public void runDouble() {
            long time = System.currentTimeMillis();
            int count = 0;

            while (System.currentTimeMillis() - time < 10000) {
                  double a = x;
                  double b = y;

                  for (int i = 0; i < count; i++) {
                        b = a * b + a;
                  }

                  f3 = b;
                  count++;
            }
            System.out.println(
                  "Double : " + count * 1000L / (System.currentTimeMillis() - time) + " iterations / s");
      }

      public void runFixed() {
            long time = System.currentTimeMillis();

            while (System.currentTimeMillis() - time < 10000) {
                  int a = (int) (x * 65536.0f);
                  int b = (int) (y * 65536.0f);

                  for (int i = 0; i < count; i++) {
                        b = (a >> 8) * (b >> 8) + a;
                  }

                  f2 = (float) b / 65536.0f;
                  count++;
            }
            System.out.println(
                  "Fixed : " + count * 1000L / (System.currentTimeMillis() - time) + " iterations / s");
      }

      public static void main(String[] args) {
            MathTest2 m = new MathTest2();
            
            for (int i = 0; i < 10; i++) {
                  m.runFloat();
                  m.runDouble();
                  m.runFixed();
      
                  System.out.println(m.f1 + ", " + m.f2 + ", " + m.f3);
                  
                  m.f1=m.f2=0;
                  m.f3=0;
            }
            
            System.out.println("----------------");
            
            m.runFloat();
            m.runDouble();
            m.runFixed();

            System.out.println(m.f1 + ", " + m.f2 + ", " + m.f3);
      }
}