Server and client optimalizations

When I created a little 128 bit library, I did a lots of benchmarks.
So little results about Java5 beta. ~_^

client doesn’t remove array boundary checks, or it will do it much less efficiently than server.
client doesn’t inline even getters, setters.
server have strange behaviour with Xcompile / Xbatch. It looks like something is running on background and slowing first few instructions in the method.

Also, I’m unsure if server could compile

int some = (int) (someLong & constForConversionToInt);
to
mov eax, [the right part of long]

Do you have more experience with differences between client and server, int terms of code generation?

Basically Java6 would need a nice library with multiple compilation level library, and transfer a lot of features from server to client. Namelly function call inlining, and array boundary check removing.

Thats right - client can only inline non-virtual methods because otherwise it would require teh ability to de-optimize code only server is able to do.
Furthermore it does not remove array-bound-checks since it does not help a lot for common applications (read: swing based client programs)

And why should Java6 “need” this wonder?
The fact why its not in client is to keep jvm size smaller (read: footprint) and keep compile-time down - I would lough if you would sell your customers a program which would have function inlineing and bounds-check-removal but would take ages to start up. Btw the source is downloadable - grab yours and implement it - I think sun would be very happy about your contribution.

Btw. multiple compilation modes are planned for Dolphin (java-7.0), where methods are first compiled with client and later width server optimizations as IBM’s jvm currently does (also does not help a lot).
Personally I think a JIT cache would be more usefule (since almost all optimizations of client-jvm can be shared unlike server-genated code).

lg Clemens

AIUI the intent for the next release is to folds client back into server so you have a single compiler that does multi-stage compilation. The first pass woudl be m,roe or less euivalent to client today, it would then go back and do the server stuff on hotspots.

…finally delivering on the “as fast as C++” promise :wink:

Cas :slight_smile:

[quote]AIUI the intent for the next release is to folds client back into server so you have a single compiler that does multi-stage compilation. The first pass woudl be m,roe or less euivalent to client today, it would then go back and do the server stuff on hotspots.
[/quote]
Oh coolness! :smiley:
If a JIT cache would be added too, we’d have a perfect VM for games :wink:

[quote]And why should Java6 “need” this wonder?
[/quote]
Since we’re all doing game stuff, most of us would benefit from a good optimizing JVM. I also have the feeling the client vm is used a lot more for games than for Swing stuff.
For example on the current Client VM, my own project JEmu2 runs quite slow, while on the server it performs more than twice as fast. Even on the ancient MS VM, it performs about 30-40% better. On the IBM, it performs about as good as the Sun server VM (although last time I checked that was 1.3 IBM vs 1.4 Sun server), but without the start-up sluggishness of the Sun VM.
The bad performance of JEmu2 on the client I mostly blame on the lack of bounds check removal but there might be more reasons. The client VM is quite ok for swing stuff, but it doesn’t seem very fast for games.

The 2-stage VM would do far more than just optimise games. Swing is faster, Eclipse would be faster, everything would just be faster. All those promises and all those nerds with statements containing the words “theoretically” and “potentially” and “C++” will finally, after 10 years, be a reality instead of something that people snigger about.

And 2-stage will more or less render JIT cacheing totally redundant. Trust me on that one. The client VM compiles so fast that it’ll be far faster than a cache and then the whole point of the server VM was that it only bothered with genuine hotspots anyway. A few commandline parameter tweaks to tune it and you’re sorted.

Cas :slight_smile:

It’s not exactly difficult to cache pieces of hotspot optimalized code.
The JIT cache would be nice, but some kind of caching frequently used classes compilled in simillary manner like -Xcompile might be even better, especially if they would be prepared for fast read.
Of course how much it would be slowed by code verification and how nice would it be towards shared VM is another question. BTW How would shared VM implement firewall between applications?

As for your recommendation to try to compile source code. I’m on dial up. While I could download Java SDK in 2-3 hours, I don’t have enough money to download that source code. If I would have broadband would depend on if my neighbourds would alow me to move a simple cabel throught stairs. It looks like I would need to wait a few weeks - months. Not tme mention that I’m currently working on addition of 128 bit numbers to the Java.

That’s wonderful, now just add startup cache (for fast byte code load), and lift memory limit (and add addaptive memory allocation) and it would be even nicer.

There seems to be quite a bit of misinformation on this thread but I’ll just mention that the Java HotSpot client compiler has supported deoptimization and inlining through virtual calls via class hierarchy analysis since 1.4.

I am not so optimistic when looking to large java applications. They all have the same problem when running on client-jvm: a very large codebase (swing) with almost flat profile - and a very strict repsonsivness need.
And in my eyes the biggest problem is the compilation threshhold - which has to be awaited every time even with n-stage compilation. If a lot of common compiled methods could be cached in a way so that they are useable from other calees too (no very hard optimizations) most of the code could run on about 80% of client-code performance while more critical (real hotspots) could be given to the 2-stage compilation engine. This would also have the benefit that the first 1500 (or whatever)invokations would run already compiled code.

However I don’t know what this means in terms of footprint increase :frowning:

lg Clemens

Ken,

It would be good for you to, just once, give the misinformation list. As a genuine insider into whats going on inside (Ken’s part of the Hotspot VM team folks) I think everyone would appreciate it and benefit from it.

Jeff

I whole heartedly agree.
There are far too many misconceptions being thrown around, a list would be really nice to have.

Personally I want information about what and how the JVM optimises.

Starting from the beginning of the thread: the client JVM has the same inlining capabilities as the server JVM (i.e., class hierarchy-based inlining with support for dynamic deoptimization). It is true that the client compiler does not currently eliminate range checks. The server compiler maintains integer ranges for values and should be able to perform the optimization of loading only an integer value, but I think it currently does not. See src/share/vm/opto/mulnode.cpp in the Mustang HotSpot workspace and feel free to file an RFE or even to implement and contribute this. Implementing a cache for dynamically compiled code is pretty complicated and as far as I’ve heard BEA recently moved away from caching compiled code in JRockit which would be support for this assertion. Multi-stage compilation seems to be a better and more maintainable approach. I’m surprised the 1.1-era Microsoft VM would be faster than the HoSpot client JVM at any pure computational tasks and wonder whether any 1.1-style library calls are being used which are simply better optimized in the Microsoft stack. If you have a concrete test case which runs faster on the MSVM please provide it.

For general information about HotSpot please see the white papers at http://java.sun.com/products/hotspot/ . For release-specific information please see the documentation for each individual release. There are also several presentations on various portions of HotSpot in archived JavaOne talks at http://java.sun.com/javaone/ .

Well, multi-stage compilation will do the following:

  • More time will go into compilation
  • Responsivness like client, but performance like server

However it won’t solve the problem larger swing apps today suffer:
Performance immediatly after startup. Its not really funny that I need to tell my customer that he needs to drag the table-header a few times arraound till it wont feel sluggish or why scrolling these or those panels is slow after start.
It least saving some profiling-data would make the whole scenario much better.

Its really hard to try to share/cache code that has been optimized with closed-world optimizations, however generic compiled code can be shared easily and replaced by more optimized methods if required.
The approche Bea used was to cache fully optimized code - which is hard of course, GIJ is able to cache “JIT”-code and it works quite well (~ speed of client-jvm).

lg Clemens

Is this with the client VM??

Yes, its with client.

I have a listener on a table which only sets 6 JTextField’s Text and it feels sluggish on a 800Mht Athlon the first 50 Selections.
Another example is that reordering columns in a JTable really feels slugish (my TableRenderer is pretty optimized) but after some moving arround its really fast.

Thats very odd, have you profiled it?

A good way to see if its somehow not getting compiled soon enough is to foce everything to compile at start-up (I believe the option is -Xcompile).

-Xcomp :slight_smile:

I think it makes sense that GUI apps feel this way. The compiler appears to be designed to optimized when a method has been executed a certain number of times, and also is taking a certain proportion of the execution time. If you consider methods that are responding to UI events , such as the user dragging a table heading or scrolling, those methods are not executed in tight loops. So it stands to reason that they only hit the compile threshold after a bit of user interaction… so the user’s initial experience is that the application is slow.

The sad thing is that the CPU is often idle while waiting for user input, too bad there wasn’t a way to take advantage of that idle time to do some basic compiling,… but I think that would involve mind reading :slight_smile:

Could just run the compiler constantly in the background with an Idle priority thread…

Cas :slight_smile: