So now we're on a microbenchmarking spree

After reading the bad trig scores, I tried to create a little class for faster math using look-up tables and floats (currently only sin, cos, tan). And I tried to benchmark them in different ways: Using the client and server, using -Xcomp and without.

The results in ms:

Server with -Xcomp:
TOTAL: 35501

Server without -Xcomp
TOTAL: 19258

Client with -Xcomp:
TOTAL: 22102

Client without -Xcomp:
TOTAL: 20350

I found these results very surprising indeed.
The results are a mix between java.lang.Math and my own FMath class. Interestingly, java.lang.Math executes faster on the client than on the server JVM here but my FMath class is faster on the server.

Now before you’ll all tell me “It’s a microbenchmark and you have to know what you’re measuring”, you’re probably right although I don’t rule out I’m dealing with 1 or more defficiencies, especially in the server.
So the real question here is: What am I measuring? Is there a way to know at all, without knowing the intimite dirty secrets of hotspot?
I mean if there isn’t, microbenchmarking is absolutely useless if you’re not a hotspot developer (which is probably old news). Which is too bad, because microbenchmarking can IMHO be very useful to measure your own code.
Even macro benchmarks seem hard to draw any reasonable conclusions from then…

(I’ll post the code in a minute)

Erik


public class Test {

    public static void main(String[] args) {
        FMath.init();
        
        long totalStart = System.currentTimeMillis();
        
        for (int ii = 0; ii < 20; ii++) {
              System.out.println("FMath");
              long start = System.currentTimeMillis();
              
              float result = 0;
              
              for (int i = 0; i < 1000000; i++) {
                  float r = (float)(i / 1000000f) * FMath.PI_2;
                  result += FMath.sin(r);
                  result += FMath.cos(r);
                  result += FMath.tan(r);
              }
              
              System.out.println(result);
              System.out.println(System.currentTimeMillis() - start);
              
              System.out.println("Math");
              
              start = System.currentTimeMillis();
              
              result = 0;
              
              for (int i = 0; i < 1000000; i++) {
                  float r = (float)(i / 1000000f) * FMath.PI_2;
                  result += Math.sin(r);
                  result += Math.cos(r);
                  result += Math.tan(r);
              }
              
              System.out.println(result);
              System.out.println(System.currentTimeMillis() - start);   

              
        }
        System.out.println("***TOTAL: " + (System.currentTimeMillis() - totalStart));
    }
}

public class FMath {
    
    public static int PRECISION = 0x100000;
    public static final float PI = (float)java.lang.Math.PI;
    
    public static final float PI_2 = PI*2;
    
    private static float RAD_SLICE = PI_2 / PRECISION;
    
    private static float[] sinTable;
    private static float[] cosTable;
    private static float[] tanTable;
        
    public static void init() {
        RAD_SLICE = PI_2 / PRECISION;
        sinTable = new float[PRECISION];
        cosTable = new float[PRECISION];
        tanTable = new float[PRECISION];
        for (int i = 0; i < PRECISION; i++) {
            float rad = (float)i * RAD_SLICE;
            sinTable[i] = (float)java.lang.Math.sin(rad);
            cosTable[i] = (float)java.lang.Math.cos(rad);
            tanTable[i] = (float)java.lang.Math.tan(rad);
        }
    }
    
    private static final int radToIndex(float radians) {
        //return (int)(((radians % PI_2)/PI_2) * PRECISION);
        return (int)((radians / PI_2) * (float)PRECISION) & (PRECISION-1);
    }
    
    public static float sin(float radians) {
        return sinTable[radToIndex(radians)];
    }

    public static float cos(float radians) {
        return cosTable[radToIndex(radians)];
    }
    
    public static float tan(float radians) {
        return tanTable[radToIndex(radians)];
    }
    
}

But again, the real question is not what’s wrong with the code but is there a way to learn how you should draw conclusions from java benchmarks…

If you’ve got the time would you write a JNI version of FMath that calls #asm functions and see how that works out? :wink:

Oh and here’s something that might explain the server result - once upon a time when I wrote some terrain demo or other it ran incredibly slowly and I couldn’t figure out why. Turns out that it was using strictfp but I can’t see any way from the command line of telling the VM which maths to use.

Cas :slight_smile:

[quote]If you’ve got the time would you write a JNI version of FMath that calls #asm functions and see how that works out?
[/quote]
Well, it’s better to fix performance problems ourselves than to wait for Sun for faster math. Sun’s got fair reasons for Math to be dog slow but unfortunately that doesn’t help us at all. We don’t need the accuracy we just want raw speed. So maybe, yeah I will. LWJBM (Light Weight Java BadMath)? ;D

But I’m wondering how I should benchmark it :wink: ::slight_smile:

Suggestions…

  1. Run a “warm up” phase…
    This performs the benchmark >50,000 times ( I use 100,000 when I can wait ) exactly as it will run in the final benckmark. Usually to guarantee this, put benchmark meat is in a method that is called in a loop and

  2. Create NO objects in the benchmark. Unless of course you are benchmarking object creation/GC stuff.

  3. To find true deltas between alternate tests, first compute the test with no operations ( or minimum ) and use that as a baseline cost of calling the method and moving data. If this baseline ends up being 0, then it was optimized out and the benchmark is too simple.

  4. Pre-compute test data and store in a table. This will make the test as focused as possible, but prevent optimized out data generation code, because the compiler cannot know how the table will change so it must get the data each time.

  5. Use the higher precision timer unofficial, sun.misc.Perf high-res timer for timing.

  6. Make SURE nothing else is running on your system. Alternatively, set the Java process priority to highest.

Remember the goal is to test your code, not the GC or JIT.

If I think of more… :slight_smile:

  1. Don’t System.out until the end of each benchmark. System.out.println generates garabage for collection. Sometimes, you can get by because the GC won’t happen in the benchamrk, but be aware that is could.

8 ) Wait/Sleep. This is one I do, but I’m not positive it’s needed now. Perhaps Jeff or other can verify.
I put waits/sleeps inbetween tests so the VM/GC/JIT gets execution time for whatever it wants so as not to have to pull it in the middle of some other library call that waits/sleeps. Of course the VM can’t just block your code anywhere, but in large benchmarks with libraries, you never know where one is laying so I give the VM plenty of time and chance to do what it wants outside of my tests.

Thanks. Those seem good suggestions. :slight_smile:

BTW, anyone got a clue why in this test:
a) java.lang.Math performs faster on the client? (might have something to do with strictfp, but why does this then only affect the server?)
b) -Xcomp slows things down? (especially on the server)
…or so it seems anyway ;)…

anyways, I’ll do some more testing using shawn’s suggestions and if I can really pin down the problem to being a server defficiency in Math (which it now looks like), I’ll file a bug report.

Hmmm… Just tested the benchmark at home using 1.4.1_01 and the results were as expected: the server being slightly faster than the client (without -Xcomp, didn’t test with it).

Could maybe someone else with 1.4.2_03 run this benchmark and look if the Math output on the server is also slower? Maybe it’s some weird local problem on my machine although that machine is just 3 days old and everything freshly installed…

Erik

on 1.4.1, -Xcomp on the server still seriously degrades performance in my benchmark… :-/ (>twice as slow)

These dont surprise me at all? It looks exactly like I would expect. Am I missing something?

i assume bigger is better. If not then your results look 100% backwards and I woudl check that assumption.

-Xcomp is always better then without it, either VM

WITHOUT -Xcomp, client hits its compile threshold sooner and thus produces better numbers.

Just shows the real need to properly warm up your VM.

[quote]These dont surprise me at all? It looks exactly like I would expect. Am I missing something?

i assume bigger is better. If not then your results look 100% backwards and I woudl check that assumption.

-Xcomp is always better then without it, either VM

WITHOUT -Xcomp, client hits its compile threshold sooner and thus produces better numbers.

Just shows the real need to properly warm up your VM.
[/quote]
Huh? Maybe you are reading the numbers backwards - lower is better (right?)
WITHOUT -Xcomp it reaches the compile threshold SOONER?? I thought with -Xcomp it hit the compile threshold IMMEDIATELY.

The thing that was weird is:

Server with -Xcomp:
TOTAL: 35501

is slower than

Server without -Xcomp
TOTAL: 19258

And that the server VM is slower than the client VM with -Xcomp… unless the actual compile times are taking up (too much) time in the benchmark… then I would expect the server VM to take more time optimizing than the client VM… and yet the code likely won’t come out that different.

Could that be it or am I missing something?

Sorry I wasnt clear. If both Server and Client are run without Xcomp, Client will hit threshold sooner and perform better on short-run tests. That was my point.

As I say, if bigger is better the numbers are exactly the relationship I expect. If smaller is better the numbers are entirely BACKWARDS, which makes me question that assumption.

Guess I should try to spring some cycles to look at the benchmark itself if I can.

Hmm you’re right about the measurement. Which does make it odd, but I’m sure its explainable.

First thing is to take the loop out of the main. Having the loop in the main means on-stack-replacement which can cause odd things to happen benchmark-wise as we’ve already seen.

Second thing to do is to call the test multiple times. I wouldn’t trust any number until you’ve seen it settle into returning the same number (or within a few 100s of MS of same number) a few times. This will ensure you arent seeing any compile times or other warm-up issues in your numbers.

The test is already runing 20 times. After 3 runs or so, the numbers become stable. I’ll try to get the test out of main and see if it helps.

Ok, changed the Test to this:


public class Test {
    
    static final int PRECISION = 1000000;
    
    private void fmath() {
        System.out.println("FMath");
        long start = System.currentTimeMillis();
        
        float result = 0;
        
        for (int i = 0; i < 1000000; i++) {
            float r = (float)(i / 1000000f) * FMath.PI_2;
            result += FMath.sin(r);
            result += FMath.cos(r);
            result += FMath.tan(r);
        }
        
        System.out.println(result);
        System.out.println(System.currentTimeMillis() - start);
        
    }
    
    public void math() {
        System.out.println("Math");
        
        long start = System.currentTimeMillis();
        
        long result = 0;
        
        for (int i = 0; i < 1000000; i++) {
            float r = (float)(i / 1000000f) * FMath.PI_2;
            result += Math.sin(r);
            result += Math.cos(r);
            result += Math.tan(r);
        }
        
        System.out.println(result);
        System.out.println(System.currentTimeMillis() - start);   

        
    }
    
    public void benchmark() {
        long totalStart = System.currentTimeMillis();
        
        fmath();
        fmath();
        fmath();
        math();
        math();
        math();
        
        
        for (int ii = 0; ii < 17; ii++) {
            fmath();
            math();
        }
        System.out.println("***TOTAL: " + (System.currentTimeMillis() - totalStart));
    }

    public static void main(String[] args) {
        FMath.init();
        Test test = new Test();
        test.benchmark();
    }
}

With the above code, the numbers start to make somewhat more sense.

server:
FMath: 70
Math: 1232
***TOTAL: 26098

server with -Xcomp:
FMath: 70
Math: 1242
***TOTAL: 26498

client:
FMath: 211
Math: 1031
***TOTAL: 25036

client with -Xcomp:
FMath: 210
Math: 1031
***TOTAL: 24915

Observations:

  • client is surprisingly a little bit faster than the server in the total score.
  • FMath is (as expected) a lot faster on the server
  • Math suprisingly performs slower on the server.
  • -Xcomp doesn’t make a notable difference anymore, which I suppose is as expected in a little benchmark like this.

Ooh! FMath looks nice & fast :slight_smile: I may have to use that (as long as you don’t mind Erik)

I will try to get around to adding a sqrt/rsqrt & see if it is any faster as well & Ill post the results if it works.

You can also increase precision on the result by interpolating between the nearest two table entries, although this may slow it down to nearly the original math speed.

  • Dom