floating performance,  Take II

hello.
opend a new one as old thread drifted away and grown to long…

about the float/double benchmarks:
most benchs ive seen do the same “calculate something with three numbers” thing. this is wrong bench’ing as i guess, because no one has to “calculate” over the same number millions of times (except fractal iterations maybe)

with three numbers, all this values then fit to vm-registers /real cpu-registers rather then real memory. that makes about big advandtage for the double, which is the “native cpu format”. the fact that it is twice the data has no much impact, as all cpu-actions are 80bit at intel as someone wrote.

i just redoing some those tests now, but it seems at the moment just beginning that float is nearly twice as fast on my old machine (Athlon 1.generation, 750mhz) than double.

why this “strange” / “incompatible” results? i think is it just because i use some thousand floats, e.g. in arrays…
and maybe some real program (ok maybe very rarely) uses more than three numbers…
this may seem strange for some people, but for my part, my programs, game programs in fact :slight_smile: do so.
thats why i extended my ram from the old 1k to now 256mb… :stuck_out_tongue:

exact results will follow
& greets
Paul

[quote] as all cpu-actions are 80bit at intel as someone wrote.
[/quote]
The traditional FPU does only have 80 bit registers, however it also has modes to indicate how many bits to retain in computation (e.g. float, double, extended). In the case of Intel CPU this never seemed to make much difference, however AMD may have done more.

Recent CPU also support SSE and perhaps SSE2 which provide float and double width computation respectively. The SSE instructions are much faster, even for scalars. The lastest JVM will use SSE and (I think) SSE2 where available.

Correct, we use SSE/SSE2 instructions where useful/available. The biggest advantage though are those extra registers. Intel is definitly register starved with 8 registers, and if your doing FP calculations, those extra XMM registers relive register pressure and prevent spills

Using SSE with floats can give you a huge performance boost by performing 4 float operations at once. When performing the same operations on a float array in memory, HotSpot could align the array in memory to even 16 bytes and auto-vectorize your code to make it several times faster using SSE.

GCC 4 will do auto vectorization of relatively simple loops, don’t know about HotSpot plans though.

Btw, SSE2 allows the same optimizations on integer maths using 8, 16, 32 or 64 bit integers. And you can run MMX in parallell with SSE2 improving integer performance even further.