floating performance

ajiva · March 2, 2005, 3:55pm

[quote]When I run a simple micro benchmark on a P4, double performance is consistently slightly slower after warm-up.

It could be that this is an Athlon-only issue though (I’ve seen cases before where some numeric operations run slower than on a P4 because of some P4 specific shortcuts which are impossible on an Athlon), but I have the feeling the results are misleading somehow. I have to test at home (where I have an Athlon too).

Could you post the entire benchmark?
[/quote]
The reason double performance on Athlons (and P3s for that matter) is worse than P4s, is because we use SSE2 style registers (which are not available for Athlon/P3). Those double registers greatly speed up performance.

blahblahblahh · March 2, 2005, 4:31pm

10 seconds? I’ve seen server VM take 5 minutes to really get going…

NVaidya · March 2, 2005, 4:46pm

Need a quick clarification on this if you don’t mind…
The Athlon64s do have SSE2 support don’t they ? My 2.0GHz socket 939 Winchester 3200+ appears to have it for sure. I’ve been benching a P4 1.6GHz Williamette against the Winchester 3200+ and the P4 doesn’t seem to be doing that bad comparitively in particle tracking systems involving lots of double-based number crunching.
Thanks

princec · March 2, 2005, 5:40pm

For microbenchmarking I advise you tweak -XX:CompileThreshold=500 or so. Hotspot doesn’t need to collect very much information to optimise this kind of stuff.

Cas

ajiva · March 2, 2005, 6:25pm

Yes the Athlon64s have SSE2, but I’d need more info to say why your not seeing a large performance difference. If I was to guess, I’d say the generated code might be slightly different, and that might be causing the performance anomaly.

NVaidya · March 2, 2005, 7:09pm

OK ! so with the Athlon64s - with SSE2 support - can I take it that the JVM will (indeed) use SSE2-style registers so that the double performance of the Athlon64s will be comparable to P4s ?

In general, is the SSE2 performance of Athlon64 as good as the P4 ? And specifically, when running Java apps with the -server option ?

And for the couple of reasons mentioned earlier 1) some CPUs use extended 80-bit precision and 2) some do all the computations in doubles and reconvert to floats (the IBM RS6000 workstation, IIRC, used to do that), is it worth the trouble to stick to floats for speed benefits if memory size is not a consideration ?

Vorax · March 2, 2005, 9:11pm

[quote]With XCompile:

Float: 12.944601059 seconds.
Double: 13.299814082 seconds.
[/quote]
Believe me now

K.I.L.E.R · March 3, 2005, 5:54am

No difference in using doubles or floats.
Since people don’t use apps with -XCompile, using doubles is faster.

phazer · March 3, 2005, 6:05am

[quote]Your benchmark gave these results on my Athlon 2200 on the server VM:
Float: 2230 iterations / s
Double: 941 iterations / s
Fixed point: 2553 iterations / s
2.9308171, 2.8822784, 2.9308173801885786

A slightly modified version of your benchmark gave these results (after some warm-up):
Float : 6690 iterations / s
Double : 3975 iterations / s
Fixed : 27322 iterations / s
2.9308171, 2.8822784, 2.9308173801885786

So whatever it’s causing this difference, I don’t know, but the only thing you can conclude from your benchmark is that JRockit optimizes it better (for whatever that’s worth).
[/quote]
I was a bit confused why your code gave such a big speed improvement (10x faster fixed point), then I noticed that you use the ‘count’ variable both for iteration count and inner loop count. This is a bug, right?

Fixing the code I get the following results with Java 5 -server:

Float : 417 iterations / s
Double : 596 iterations / s
Fixed : 2253 iterations / s
2.9308171, 2.8822784, 2.9308173801885786

JRockit:

Float : 1697 iterations / s
Double : 1699 iterations / s
Fixed : 1940 iterations / s
2.9308174, 2.8822784, 2.9308173801885786

Now JRockit is 4 times faster than Hotspot on the float benchmark! :o Remember that this is a simple fmul, fadd loop. Something weird is happening here…

Vorax · March 3, 2005, 9:18am

I thought -Xcompile just makes the VM do it’s optimizations on the first pass so you don’t need to wait for a wind up for the JIT? If so, then it just means your microbenchmark isn’t running optimized like a real world app would be.

If not, what does it do?

K.I.L.E.R · March 3, 2005, 9:30am

No idea what XCompile does.
In the end floats are converted to doubles.

EgonOlsen · March 3, 2005, 9:49am

[quote]No idea what XCompile does.
[/quote]
It forces hotspot to compile all methods before using them. That way, you can make sure that you are not measuring compile time instead of execution time or that you aren’t suffering from different compilation behaviours for whatever reason.

Edit: For those who are interested in what hotspot does when: Start your app with -XX:+PrintCompilation
Combine that with -Xcompile and see what happens…

ajiva · March 3, 2005, 12:29pm

[quote]OK ! so with the Athlon64s - with SSE2 support - can I take it that the JVM will (indeed) use SSE2-style registers so that the double performance of the Athlon64s will be comparable to P4s ?

In general, is the SSE2 performance of Athlon64 as good as the P4 ? And specifically, when running Java apps with the -server option ?

And for the couple of reasons mentioned earlier 1) some CPUs use extended 80-bit precision and 2) some do all the computations in doubles and reconvert to floats (the IBM RS6000 workstation, IIRC, used to do that), is it worth the trouble to stick to floats for speed benefits if memory size is not a consideration ?
[/quote]
First a little intro on how the VM works:

The JVM has two sections of code (basically), platform independent and platform dependent code. The platform independent stuff are things that operate on the bytecodes, the IR, and then the optimizations (parsing, constant folding, loop opts, register allocation, etc).

The platform dependent stuff are basically match rules for instructions. So if the VM requires a Multiply Node (MulNode), the VM matches that to the appropriate rule in the particular architecture. Now this matching part is where the AMD64 hasn’t been fully optimized. Its mostly there, but there are parts missing, and things we don’t do, etc. So yes the VM uses SSE2 for AMD64 machines, but we might be doing a few things suboptimal.

I’ve also heard that the Athlons (XP and 64) have slower SSE performance compared to P4s. It may no longer be true in later revisions of the chip, etc. Heck I may have heard incorrectly as well. But anyway, the AMD64 as far as the JVM is concerened is just another chip, most of the optimizations are platform independent.

Oh don’t forget the AMD64 VM is a 64bit VM, while the X86 VM is a 32bit VM. Internally that means the 64bit VM has to handle larger pointers, etc. Although the VM gains 8 registers for the AMD64 so overall there is a win in performance.

Vorax · March 3, 2005, 1:08pm

Ok, that’s what I thought it did.

NVaidya · March 3, 2005, 8:05pm

First a little intro on how the VM works:

The JVM has two sections of code (basically), platform independent and platform dependent code. The platform independent stuff are things that operate on the bytecodes, the IR, and then the optimizations (parsing, constant folding, loop opts, register allocation, etc).

The platform dependent stuff are basically match rules for instructions. So if the VM requires a Multiply Node (MulNode), the VM matches that to the appropriate rule in the particular architecture. Now this matching part is where the AMD64 hasn’t been fully optimized. Its mostly there, but there are parts missing, and things we don’t do, etc. So yes the VM uses SSE2 for AMD64 machines, but we might be doing a few things suboptimal.

I’ve also heard that the Athlons (XP and 64) have slower SSE performance compared to P4s. It may no longer be true in later revisions of the chip, etc. Heck I may have heard incorrectly as well. But anyway, the AMD64 as far as the JVM is concerened is just another chip, most of the optimizations are platform independent.

Firstly, it’s great to have you around here :).

Hmm… so may be I wasn’t way off in suspecting that the AMD64 wasn’t giving as good a performance boost as I thought it would going by the gaming benchmarks dished out by the hardware review sites. And yes, I’ve also heard that AMD’s SSE implementation still lags Intel. There is a C based benchmark called ScienceMark (http://www.sciencemark.org), developed by Dr. Wilkens who I understand is currently with AMD, which can be used for testing among other things the SSE and SSE2 performance of the CPU. You may have probably heard of it.

Hopefully you folks could get around to implementing the optimized SSE2 for the AMD64 also. What would it take to do that ? An RFE perhaps. That takes time doesn’t it ? And with AMD’s Venice core scheduled to be released this or next quarter, there will be interest too in SSE3 type optimizations.

Thanks

ajiva · March 4, 2005, 1:57am

A RFE is not needed, we know about this and several people (including myself) are working on this. Granted its not high priority as other work is taking up our time. But we’ll get to it. I did some reasearch into SSE3, and those instructions don’t seem to be well suited towards the VM. The only instruction that I could think of might be FISTTP, but I haven’t fully explored that area yet so I’m not sure what else might come in handy.

K.I.L.E.R · March 4, 2005, 2:01am

What’s RFE stand for?

trembovetski · March 4, 2005, 3:57am

RFE == Request for enhancement

Raghar · March 4, 2005, 6:46pm

I reruned tests and result was 30 s for Floats, 9 s for doubles. Presscott Celeron D underclocked to 1.8 GHz. JRockit showed simillar performance, just 3x as slow. It seems they didn’t did a SSE2 optimalizations.
I might try some ASM programs, just I should need to know how can I setup FP precision in assembly, never needed to go down from doubles.

Re FP16.
NVIDIA wievs FP as a number from -1.0 - 1.0 This isn’t necessary too much compatibile with other FP formats. Raster drawing has somewhat limited target.

Raghar · March 5, 2005, 6:18pm

I tried it with scimark and I have nearly 2 G FLOPS in double precision and more that 2.4 G FLOPS in a single precision. It’s underclocked to 1.8 GHz.
Quite puzzling.

BTW why JVM uses MSVC 6.0 for compiling? Isn’t here a MSVC toolkit 7.1 2003?