Yet another speed comparison,weird server results

On the previous thread about the article with those microbenchmarks comparing java with C++ somebody on that ‘Comments’ board posted some real code (a mandelbrot generator) which he converted from C to Java. On 1.4.2_04 client, it was about 8% slower.
I did a little change without altering the algo’s in any way (I just made everything non static) and got to almost exactly the same performance as the C version.
Did another test on 1.5.0 beta 2 client and behold: the java version is even faster than the C version.

But: when I test on the server (both 1.4.2 and 1.5.0), the results are very disappointing. The server performs the test ~30% slower than the client!
As a matter of fact, I’ve seen this kind of bad performance of the server VM very often (for example in a program I wrote for a customer where huge text files are converted to even more huge XML documents).
This particular test runs 100 times (which means 100 times 15 seconds total), but performance doesn’t get any better over time.

Anybody has an idea?

I know it’s yet another benchmark, but the fact that I keep seeing the server VM perform so badly kind of worries me :-/


public class JMandel {

      int[] rowBuffer_ = new int[500];
      int w_ = 500;
      int h_ = 500;
      int maxi_ = 9999;
      double ax_ = -2.0f;
      double ay_ = -1.5f;
      double ex_ = 1.0f;
      double ey_ = 1.5f;
      double sx_ = (ex_ - ax_) / ((double) w_);
      double sy_ = (ey_ - ay_) / ((double) h_);

      long msStart_;
      long msEnd_;
      long itersTotal_;

      private void run() {
            for (int i = 0; i < 100; i++) {
                  itersTotal_ = 0;
                  msStart_ = System.currentTimeMillis();
                  for (int y = 0; y < h_; y++) {
                        calcPixelRow(y, maxi_);
                  }
                  msEnd_ = System.currentTimeMillis();
                  printResults();
                  long msTotal = msEnd_ - msStart_;
                  double its = ((double) itersTotal_) / (double) msTotal;
                  System.out.println(
                        "Runtime ms=" + msTotal + " " + its / 1000.0 + " MegaIters per second");
            }
      }

      private void printResults() {
            for (int i = 0; i < rowBuffer_.length; i++) {
                  System.out.print(rowBuffer_[i]);
            }
      }

      private boolean calcPixelRow(int row, int maxi) { // C row calculation
                                                                                // routine // Calc vars
            // C row calculation routine
            // Calc vars
            double cx = ax_;
            double cy = ay_ + sy_ * ((double) row);
            double zx, zy;
            double zx2, zy2;

            for (int x = 0; x < w_; x++) {
                  // Calc Pixel
                  zx = cx;
                  zy = cy;
                  int i;
                  for (i = 0; i < maxi; i++) {
                        zx2 = zx * zx;
                        zy2 = zy * zy;
                        if ((zx2 + zy2) > 4)
                              break;
                        zy = 2 * zx * zy;
                        zx = zx2 - zy2;
                        zx += cx;
                        zy += cy;
                  }
                  cx += sx_;
                  itersTotal_ += i;
                  rowBuffer_[x] = i;
            }

            return true;
      }

      public static void main(String[] args) {
            JMandel jMandel = new JMandel();
            jMandel.run();
      }
}

Some quick sample results…I haven’t looked at the testcase in detail, but is msTotal value alone indicative of the speed diff. - lesser the better ? Hope I have these right !

server:
Runtime ms=6750 62.42653407407407 MegaIters per second

client:
Runtime ms=23500 17.93102574468085 MegaIters per second

Edit: …on j2sdk1.4.2_03

Yeah, lesser is better.
I don’t get it! Why am I not getting results like that?! :o
I also tested on 1.4.2_03 and on 1.5.0 beta 2 but I’m getting far worse numbers on the server VM. :’(

I ran it on Windows - server’s SSE2 possibly makes a difference here !?!

hmyeah, maybe. I run on an Athlon XP and those don’t have SSE2 do they?
Still not sure why the server would be slower than the client though.

Could you do a re-run with the following:


public class JMandel {

      int[] rowBuffer_ = new int[500];
      int w_ = 500;
      int h_ = 500;
      int maxi_ = 9999;
      double ax_ = -2.0f;
      double ay_ = -1.5f;
      double ex_ = 1.0f;
      double ey_ = 1.5f;
      double sx_ = (ex_ - ax_) / ((double) w_);
      double sy_ = (ey_ - ay_) / ((double) h_);

      long msStart_;
      long msEnd_;
      long itersTotal_;
      long rendertime;

      private void run() {
            for (int i = 0; i < 100; i++) {
                  itersTotal_ = 0;
                  rendertime = 0;
                  msStart_ = System.currentTimeMillis();
                  for (int y = 0; y < h_; y++) {
                        calcPixelRow(y, maxi_);
                        rendertime += printResults();
                  }
                  msEnd_ = System.currentTimeMillis() - rendertime;
                  long msTotal = msEnd_ - msStart_;
                  double its = ((double) itersTotal_) / (double) msTotal;
                  System.out.println(
                        "\n\nRuntime ms=" + msTotal + " " + its / 1000.0 + " MegaIters per second");
            }
      }

      private long printResults() {
            long start = System.currentTimeMillis();
            for (int i = 0; i < rowBuffer_.length; i++) {
                  System.out.print(rowBuffer_[i]);
            }
            return System.currentTimeMillis() - start;
      }

      private boolean calcPixelRow(int row, int maxi) { // C row calculation
                                                                                // routine // Calc vars
            // C row calculation routine
            // Calc vars
            double cx = ax_;
            double cy = ay_ + sy_ * ((double) row);
            double zx, zy;
            double zx2, zy2;

            for (int x = 0; x < w_; x++) {
                  // Calc Pixel
                  zx = cx;
                  zy = cy;
                  int i;
                  for (i = 0; i < maxi; i++) {
                        zx2 = zx * zx;
                        zy2 = zy * zy;
                        if ((zx2 + zy2) > 4)
                              break;
                        zy = 2 * zx * zy;
                        zx = zx2 - zy2;
                        zx += cx;
                        zy += cy;
                  }
                  cx += sx_;
                  itersTotal_ += i;
                  rowBuffer_[x] = i;
            }

            return true;
      }

      public static void main(String[] args) {
            JMandel jMandel = new JMandel();
            jMandel.run();
      }
}

It prints all output, just to make sure and the time to do it is subtracted from the final time.
It ouputs a lot, so I suggest redirecting the output to a file like java -server JMandel > mandel.log

I think my cygwin+bash+vim didn’t like the output format from the program. OK ! didn’t actually run thru’ the entire loop of 100, but here are the first several (system I’m using here is P4 1.6GHz):

client:
Runtime ms=44652 9.43695926274299 MegaIters per second
Runtime ms=44897 9.38546239169655 MegaIters per second
Runtime ms=45103 9.342595947054521 MegaIters per second
Runtime ms=45086 9.346118639932573 MegaIters per second
Runtime ms=36438 11.564276442175752 MegaIters per second
Runtime ms=44075 9.560501531480432 MegaIters per second
Runtime ms=43911 9.596208353260003 MegaIters per second
Runtime ms=44377 9.49543919147306 MegaIters per second
Runtime ms=42126 10.002827351279494 MegaIters per second
Runtime ms=32172 13.097696910356833 MegaIters per second
Runtime ms=27847 15.1319389880418 MegaIters per second
Runtime ms=32283 13.052662546851284 MegaIters per second
Runtime ms=23225 18.143341442411195 MegaIters per second
Runtime ms=44006 9.575492091987456 MegaIters per second
Runtime ms=46699 9.023300391871345 MegaIters per second

server:
Runtime ms=13246 31.811800166087878 MegaIters per second
Runtime ms=13342 31.582903987408187 MegaIters per second
Runtime ms=13438 31.35727824080964 MegaIters per second
Runtime ms=13543 31.114162667060473 MegaIters per second
Runtime ms=13736 30.67698784216657 MegaIters per second
Runtime ms=13335 31.599482939632548 MegaIters per second
Runtime ms=13485 31.24798702261772 MegaIters per second
Runtime ms=13437 31.359611892535536 MegaIters per second
Runtime ms=13631 30.91329359548089 MegaIters per second
Runtime ms=13410 31.422752050708425 MegaIters per second
Runtime ms=12652 33.30533552007587 MegaIters per second

Edit: erikd: The above results are for version 2 of JMandel.java

Just tried the first verision on Windows JRE 1.5 beta2

Client = 9 MegaIters
Server = 31 MegaIters

Can’t explain why you aren’t gettign better performance on the Server VM. I’ll try this on the Mac later just for kicks.

Hmmm, on my laptop I also get good results on the server.
Seems like a bug in the server VM to me. Does anyone else use an Athlon and get similar results?

So, compared to the original version, on my laptop the benchmark runs about as fast as the FPU ASM version and almost 3.5 times as fast as the pure C version. Quite amazing, really.
The asm versions that use SSE instructions still beat the crap out of the java version, but still the results are far better than I expected.

Heavy float usage often stuffs many ‘cheap’ C compilers (VC6, GCC), the Intel & Visual Studio.net do a much better job, but even so they don’t usually do this:

Nice! I wish the people who make the JITs would be more open about what the JIT can do - this sort of info would help convince developers (hey, get your free SSE2 optimisations over here!).

  • Dom

Now I want 3DNow! support too to help us poor AthlonXP owners :smiley:

I think I’m going to report a performance bug regarding the worse than client performance of the server VM running on an Athlon and see what happens.

[quote]Now I want 3DNow! support too to help us poor AthlonXP owners :smiley:

I think I’m going to report a performance bug regarding the worse than client performance of the server VM running on an Athlon and see what happens.
[/quote]
Yes.

Your Mandelbrot on an AMD Athlon XP2500+ Barton :
java version “1.4.2_04”
Client : Runtime ms=10696 39.40 MegaIters per second
Server: Runtime ms=15121 27.87 MegaIters per second

Sigh, indeed currently SUNs server VM doesn’t use Athlons in the right way. :frowning: Please fill a bug report.

P.S. Just some weeks ago the IT-press reported that for the first time AMD sells as many AMD desktop CPUs than Intel. Another reason for Java to use Athlons in the right way.

It’s done. I’ll keep you guys updated.

Results of running JMandel on Windows XP

pentium II 400 Mhz <<
512 MB ram

Java HotSpot™ Server VM (build 1.5.0-beta2-b51, mixed mode)

  Runtime ms=57152 7.372954664753639 MegaIters per second  :o
  Runtime ms=57152 7.372954664753639 MegaIters per second  :o
  Runtime ms=56832 7.414469049127252 MegaIters per second  :o

Java HotSpot™ Client VM (build 1.5.0-beta2-b51, mixed mode, sharing)

  Runtime ms=34910 12.070441277570897 MegaIters per second
      Runtime ms=34910 12.070441277570897 MegaIters per second

J2RE 1.4.1 IBM Windows 32 build cn1411-20040301a (JIT enabled)

  Runtime ms=22893 18.406460708513517 MegaIters per second  :)
  Runtime ms=22863 18.430612999168964 MegaIters per second  :)

C++ version of JMandel compiled with g++:

  Runtime ms =[b]22692[/b] 18.531 MegaIters per second
  Runtime ms =[b]22753 [/b]18.481 MegaIters per second

Hmm, now it seems the very bad server VM performance is not Athlon specific, but affects any CPU not supporting SSE2, so the problem is far more serious than I thought. It seems now that currently the server VM performs far worse than the client on most systems! Well, in this particular case that is.

I’m wondering what happens with the performance difference between the client and server (on non-sse2 CPUs) if the test is converted to float instead double precision.

[quote]if the test is converted to float instead double precision.
[/quote]
Ok this is just being used as a test, but the Mandelbrot is a case where precision makes a significant difference to the result.

Of course but I’m not trying to alter the test in order to make it quicker.
When we use float instead of double we’re not comparing to the double precision version of the original program anymore, but we should compare to the SSE version of the original (but only when we’re running the test on an SSE supporting CPU).
But my reason for converting to float is that I want to see what happens to the server performance compared to the client, so we can maybe narrow down the cause of the problem.
If when using floats the server has acceptable performance compared to the client (on an AthlonXP) than I can conclude that there is probably a bug regarding SSE2 optimizations (in case those are not possible).
If not, the problem lies elsewhere and we can begin to doubt the usability of the server VM in its current state on possibly even the majority of x86 platforms… :-/
Which I am currently anyway, given my own personal (generally bad) experiences with the server VM on my Athlon.
I’ll do the test when I get home.

I have a Mandelbrot program hanging around, which I’ve tested a bit too.

The inner loop is pretty much identical (actually I changed it to be identical to yours) except I’m using floats here.

On my Athlon XP, 1.4.2_04 server is twice as fast as 1.4.2_04 client. 1.5.0-b2 client is slightly slower than 1.4.2 client.

[quote]On my Athlon XP, 1.4.2_04 server is twice as fast as 1.4.2_04 client. 1.5.0-b2 client is slightly slower than 1.4.2 client.
[/quote]
Be sure to report performance regressions! Specially since you have a nice simple test case. Catch it now while 1.5 is still beta.

What about 1.5 server??