Yet another speed comparison,weird server results

erikd · June 23, 2004, 8:39pm

On my (athlon) machine:


 * 1.5.0 beta 2:
 * Runtime ms=9372 44.9614922108408 MegaIters per second       (client, double)
 * Runtime ms=13844 30.43767010979486 MegaIters per second       (server, double)
 * Runtime ms=9531 44.20146875 MegaIters per second             (client, float)
 * Runtime ms=3470 121.407546875 MegaIters per second       (server, float)
 * 
 * 1.4.2_03
 * Runtime ms=9815 42.9321553744269 MegaIters per second      (client, double)
 * Runtime ms=13752 30.641296175101804 MegaIters per second (server, double)
 * Runtime ms=10154 41.48948046875 MegaIters per second            (client, float)
 * Runtime ms=3608 116.7639140625 MegaIters per second            (server, float)

So 1.5.0 beta 2 seems to be as fast as, or a fraction faster than 1.4.2_03. The server’s float performance seems good.
It’s interesting to see that on the client, performance with double precision is not lower than floats.

princec · June 24, 2004, 7:41am

Bodes well for games programming doesn’t it And very poorly for Java3D with its heavy reliance on doubles…

Cas

erikd · June 24, 2004, 11:35am

[quote]Bodes well for games programming doesn’t it And very poorly for Java3D with its heavy reliance on doubles…

Cas
[/quote]
On an Athlon, yeah. But on a P4, Java3D should perform excellent since doubles on the server are even faster than floats.

EDIT: correction, not faster than float, but the diff between server and client is larger when dealing with doubles.

NVaidya · June 24, 2004, 2:02pm

IIRC, there was a (regression) bug associated with proper alignment of doubles. Maybe this never got fixed on server (assuming, of course, that this is not a prerequisite for SSE2 which does appear to work).

blahblahblahh · June 24, 2004, 3:45pm

Just a minor question…why do you have:


for (i = 0; i < maxi; i++) {
    zx2 = zx * zx;
    zy2 = zy * zy;
    if ((zx2 + zy2) > 4)
     break;
    zy = 2 * zx * zy;
    zx = zx2 - zy2;
    zx += cx;
    zy += cy;
   }

instead of:


for (i = 0; i < maxi; i++) {
    zx2minuszy2 = (zx + zy) * (zx - zy);
    if ((zx2minuszy2) > 4)
     break;
    zy = 2 * zx * zy;
    
    zx = zx2minuszy2;
    zx += cx;
    zy += cy;
   }

you have 4 muls and 2 adds instead of 3 muls and 2 adds…? Just going off the top of my head from when I did a fast 3D rotating mand 6 years ago, so I’m probably making some silly mistake here :(.

IIRC, back then reducing by one the number of FP muls was noticeable (this would have been on 1.1.x JVM’s)

mthornton · June 24, 2004, 5:16pm

The comparison is of xx + yy.

However if we actually wanted xx - yy, then the obvious computation may actually be faster because the two multiplications can start one clock cycle apart, and when they are finished (some) 20 clock cycles later all that remains is the subtraction. By contrast in the alternative expression (x+y)*(x-y) the multiplication can’t start until after both addition and subtraction have completed. The total time taken will then be very similar. A more important consideration today may be the accuracy of the result.

erikd · June 24, 2004, 8:17pm

[quote]Just a minor question…why do you have:
[/quote]
It’s not my program. It’s a java port of a fun little program called ‘FFFF’ (you can find it on sourceforge), ported from the original C source by the original author just to compare java’s performance to the C version. I just changed it a little bit to make it not fully static. I didn’t even look at the algorithm (apart from comparing it with the original sources).
I figured if I would try to optimize the algorithm, the benchmark would become invalid.

blahblahblahh · June 24, 2004, 9:50pm

[quote]The comparison is of xx + yy.
[/quote]
…but mandelbrot is |(a+bj)| <=2, no? In which case, that’s (a2-b2) + (2ab)j <= 4? I just remember that a diff of squares was in there somewhere ;)…

Thanks. As I said, IIRC it used to have a significant effect (presumably not-very-good JITing); I see what you mean with pipelining. Might this change have a significant effect on the abscence/presence of sse/3dnow optimizations?

mthornton · June 25, 2004, 6:24am

You want a benchmark that actually computes the correct value!

Pipelines and superscalar execution certainly make estimating performance difficult. They ought to make use of a JIT very attractive if one wants optimal performance out of Pentium III, 4, Athlon XP, Athlon 64, Via Eden, Transmeta (Crusoe, Efficieon), etc.

princec · June 25, 2004, 7:44am

If only the VMs actually did JITing based on the processor. As it is it seems that it’s SSE2 or nothing.

Cas

mthornton · June 25, 2004, 7:57am

The server VM does check for SSE as well as SSE2, but yes support for more processor dependent features (or performance differences) would be welcome.

erikd · June 30, 2004, 8:09pm

The bug report has been approved.

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5070061

(It may take a day or two to show up).

Make sure to vote for this bug if you feel it affects your work.

ajiva · July 14, 2004, 6:17pm

Hi,
The reason Server is slower than client is the way it handles FP. Its just inefficiently coded. That’s why P4’s run the mandelbrot faster, because of the extra available registers (the 8 SSE registers), which reduces register pressure. Basically it was decided that SSE is the future, and server was optimized for that. There really isn’t much to be gained to improving the FP code.

erikd · July 14, 2004, 10:39pm

[quote]Basically it was decided that SSE is the future, and server was optimized for that. There really isn’t much to be gained to improving the FP code.
[/quote]
I’m thinking about the here and now.
This isn’t about improving FP code, it’s about fixing a rather serious performance bug in the server. Remember the majority of people don’t have SSE2 and are affected by this bug.
They could at least copy and paste the FP code from the client version methinks

swpalmer · July 15, 2004, 12:03am

ajiva, based on your comments I’m guessing you are a Sun engineer? Am I right? If so, what do you do at Sun?

ajiva · July 15, 2004, 1:59pm

First of all, yes I work for Sun, working on Server (4 years here so far). Secondly, its not easy to just copy and paste the code. There were very different methologies used to get to where each of the compilers are now. Client is faster because of so many reasons, and when it came time to speed up FP code for Server it was decided that we should optimize for SSE, because that really is the future. SSE provides several benefits, and yes even Athlons will get them too.

8 FP registers, this is the biggie. X86 has way too few registers, and if your doing FP work, you’ll need these registers.
Even if you have an Athlon (or P3 for that matter), just stick to using Floats and you’ll used the single precision SSE instructions (vs the double precision SSE2 ones). Now if your code requires double precision, well go buy a P4 (or an Opteron)

Really we were thinking about the future here, and the fact of the matter is that all new Processors would be SSE/SSE2 enabled and we felt that it was better to work towards that then try to minimally improve FP for older processors

princec · July 15, 2004, 4:47pm

Now you’ve revealed yourself you will never escape us

What’s the scoop on

escape analysis
two-phase compilation (ie. the merging of client and server VMs)
Structs (see RFE…)

Cas

ajiva · July 15, 2004, 4:51pm

I can’t talk about future work…

swpalmer · July 16, 2004, 3:33am

Fair enough…
Can you talk about what IS in the 1.5 betas? In other words can you confirm that escape analysis for instance isn’t in the current 1.5 betas… that way at least you aren’t talking about “future” work and we can get an idea of what (not) to expect.

Basically it would be great to get an idea of the performance enhancements that are going into the 1.5 VMs. Hopefully there are some :).

swpalmer · July 16, 2004, 3:51am

[quote]- 8 FP registers, this is the biggie. X86 has way too few registers, and if your doing FP work, you’ll need these registers.
[/quote]
[offtopic-rant]
The Intel processor architecture sucks big time - there is no debating that… it hasn’t come very far from the ancient calculator it was based on as far as that goes That’s one of the many reasons I went with a Mac.

Sparc, PowerPC, MIPs, Alpha… all superior. All but Sparc suffering because MS wouldn’t continue to keep NT working on them - I suspect the MS developers had trouble with machines that had more registers than they had fingers. Not having an operating system kinda makes your processor less popular :).
[/offtopic-rant]

The opteron has lots of registers (that aren’t SSE) doesn’t it? I remember reading something that indicated it was a decent processor - 32 or so general purpose registers, instead of 5 or 6 special purpose ones or whatever intel has these days

SSE is mostly about vector instructions anyway isn’t it. Exploiting that really gets you some speed if you can do it. The extra registers are just good in general. I keep suggesting that the bits of native code that are in the JRE use optimized assembly with SSE/SSE2 for things like JPEG loaders, and image blitting loops the improvements would be huge. All of Java2D’s AlphaCompositing rules could be implemented with SSE2 and they would scream.