Yet another speed comparison,weird server results

On my (athlon) machine:


 * 1.5.0 beta 2:
 * Runtime ms=9372 44.9614922108408 MegaIters per second       (client, double)
 * Runtime ms=13844 30.43767010979486 MegaIters per second       (server, double)
 * Runtime ms=9531 44.20146875 MegaIters per second             (client, float)
 * Runtime ms=3470 121.407546875 MegaIters per second       (server, float)
 * 
 * 1.4.2_03
 * Runtime ms=9815 42.9321553744269 MegaIters per second      (client, double)
 * Runtime ms=13752 30.641296175101804 MegaIters per second (server, double)
 * Runtime ms=10154 41.48948046875 MegaIters per second            (client, float)
 * Runtime ms=3608 116.7639140625 MegaIters per second            (server, float)

So 1.5.0 beta 2 seems to be as fast as, or a fraction faster than 1.4.2_03. The server’s float performance seems good.
It’s interesting to see that on the client, performance with double precision is not lower than floats.

Bodes well for games programming doesn’t it :slight_smile: And very poorly for Java3D with its heavy reliance on doubles…

Cas :slight_smile:

[quote]Bodes well for games programming doesn’t it :slight_smile: And very poorly for Java3D with its heavy reliance on doubles…

Cas :slight_smile:
[/quote]
On an Athlon, yeah. But on a P4, Java3D should perform excellent since doubles on the server are even faster than floats.

EDIT: correction, not faster than float, but the diff between server and client is larger when dealing with doubles.

IIRC, there was a (regression) bug associated with proper alignment of doubles. Maybe this never got fixed on server (assuming, of course, that this is not a prerequisite for SSE2 which does appear to work).

Just a minor question…why do you have:


for (i = 0; i < maxi; i++) {
    zx2 = zx * zx;
    zy2 = zy * zy;
    if ((zx2 + zy2) > 4)
     break;
    zy = 2 * zx * zy;
    zx = zx2 - zy2;
    zx += cx;
    zy += cy;
   }

instead of:


for (i = 0; i < maxi; i++) {
    zx2minuszy2 = (zx + zy) * (zx - zy);
    if ((zx2minuszy2) > 4)
     break;
    zy = 2 * zx * zy;
    
    zx = zx2minuszy2;
    zx += cx;
    zy += cy;
   }

you have 4 muls and 2 adds instead of 3 muls and 2 adds…? Just going off the top of my head from when I did a fast 3D rotating mand 6 years ago, so I’m probably making some silly mistake here :(.

IIRC, back then reducing by one the number of FP muls was noticeable (this would have been on 1.1.x JVM’s)

The comparison is of xx + yy.

However if we actually wanted xx - yy, then the obvious computation may actually be faster because the two multiplications can start one clock cycle apart, and when they are finished (some) 20 clock cycles later all that remains is the subtraction. By contrast in the alternative expression (x+y)*(x-y) the multiplication can’t start until after both addition and subtraction have completed. The total time taken will then be very similar. A more important consideration today may be the accuracy of the result.

[quote]Just a minor question…why do you have:
[/quote]
It’s not my program. It’s a java port of a fun little program called ‘FFFF’ (you can find it on sourceforge), ported from the original C source by the original author just to compare java’s performance to the C version. I just changed it a little bit to make it not fully static. I didn’t even look at the algorithm (apart from comparing it with the original sources).
I figured if I would try to optimize the algorithm, the benchmark would become invalid.

[quote]The comparison is of xx + yy.
[/quote]
…but mandelbrot is |(a+bj)| <=2, no? In which case, that’s (a2-b2) + (2ab)j <= 4? I just remember that a diff of squares was in there somewhere ;)…

Thanks. As I said, IIRC it used to have a significant effect (presumably not-very-good JITing); I see what you mean with pipelining. Might this change have a significant effect on the abscence/presence of sse/3dnow optimizations?

You want a benchmark that actually computes the correct value! :wink:

Pipelines and superscalar execution certainly make estimating performance difficult. They ought to make use of a JIT very attractive if one wants optimal performance out of Pentium III, 4, Athlon XP, Athlon 64, Via Eden, Transmeta (Crusoe, Efficieon), etc.

If only the VMs actually did JITing based on the processor. As it is it seems that it’s SSE2 or nothing.

Cas :slight_smile:

The server VM does check for SSE as well as SSE2, but yes support for more processor dependent features (or performance differences) would be welcome.

The bug report has been approved.

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5070061

(It may take a day or two to show up).

Make sure to vote for this bug if you feel it affects your work.

Hi,
The reason Server is slower than client is the way it handles FP. Its just inefficiently coded. That’s why P4’s run the mandelbrot faster, because of the extra available registers (the 8 SSE registers), which reduces register pressure. Basically it was decided that SSE is the future, and server was optimized for that. There really isn’t much to be gained to improving the FP code.

[quote]Basically it was decided that SSE is the future, and server was optimized for that. There really isn’t much to be gained to improving the FP code.
[/quote]
I’m thinking about the here and now.
This isn’t about improving FP code, it’s about fixing a rather serious performance bug in the server. Remember the majority of people don’t have SSE2 and are affected by this bug.
They could at least copy and paste the FP code from the client version methinks :slight_smile:

ajiva, based on your comments I’m guessing you are a Sun engineer? Am I right? If so, what do you do at Sun?

First of all, yes I work for Sun, working on Server (4 years here so far). Secondly, its not easy to just copy and paste the code. There were very different methologies used to get to where each of the compilers are now. Client is faster because of so many reasons, and when it came time to speed up FP code for Server it was decided that we should optimize for SSE, because that really is the future. SSE provides several benefits, and yes even Athlons will get them too.

  • 8 FP registers, this is the biggie. X86 has way too few registers, and if your doing FP work, you’ll need these registers.
  • Even if you have an Athlon (or P3 for that matter), just stick to using Floats and you’ll used the single precision SSE instructions (vs the double precision SSE2 ones). Now if your code requires double precision, well go buy a P4 (or an Opteron) :slight_smile:

Really we were thinking about the future here, and the fact of the matter is that all new Processors would be SSE/SSE2 enabled and we felt that it was better to work towards that then try to minimally improve FP for older processors

Now you’ve revealed yourself you will never escape us :slight_smile:

What’s the scoop on

  • escape analysis
  • two-phase compilation (ie. the merging of client and server VMs)
  • Structs (see RFE…)

Cas :slight_smile:

I can’t talk about future work…

Fair enough…
Can you talk about what IS in the 1.5 betas? In other words can you confirm that escape analysis for instance isn’t in the current 1.5 betas… that way at least you aren’t talking about “future” work and we can get an idea of what (not) to expect.

Basically it would be great to get an idea of the performance enhancements that are going into the 1.5 VMs. Hopefully there are some :).

[quote]- 8 FP registers, this is the biggie. X86 has way too few registers, and if your doing FP work, you’ll need these registers.
[/quote]
[offtopic-rant]
The Intel processor architecture sucks big time - there is no debating that… it hasn’t come very far from the ancient calculator it was based on as far as that goes :slight_smile: That’s one of the many reasons I went with a Mac.

Sparc, PowerPC, MIPs, Alpha… all superior. All but Sparc suffering because MS wouldn’t continue to keep NT working on them - I suspect the MS developers had trouble with machines that had more registers than they had fingers. Not having an operating system kinda makes your processor less popular :).
[/offtopic-rant]

The opteron has lots of registers (that aren’t SSE) doesn’t it? I remember reading something that indicated it was a decent processor - 32 or so general purpose registers, instead of 5 or 6 special purpose ones or whatever intel has these days

SSE is mostly about vector instructions anyway isn’t it. Exploiting that really gets you some speed if you can do it. The extra registers are just good in general. I keep suggesting that the bits of native code that are in the JRE use optimized assembly with SSE/SSE2 for things like JPEG loaders, and image blitting loops the improvements would be huge. All of Java2D’s AlphaCompositing rules could be implemented with SSE2 and they would scream.