GC Implementation Specifics

ChrisRijk · January 23, 2004, 11:37am

I think MaxInlineSize is the cutoff for total number of bytecodes to inline (ie includes sub-functions too)

The FreqInlineSize explanation doesn’t make much sense, but I guess it means that HS will never inline a function which is bigger than this value (in bytecodes).

NVaidya · January 23, 2004, 12:58pm

Much appreciate the info folks.

Is the -Xcomp option equivalent to setting the CompileThreshold
size to 0 ?

If I want to use the server for the SSE and SSE2 boosts (the client
doesn’t have them right ?), and at the same time want my GUI to
come to life as eagerly as a client, then setting the threshold size
to something like that of the client would get me the best of both
worlds ?

How do I figure out the bytecode size to specify as input to
MaxInlineSize (gulp…especially if it has to include the
sub-functions sizes too !!!) ? And, thanks really for the caveat
about going overboard in setting this size.

And something I tripped on in an IBM site:
Quote: …specifying -Xcomp will result in somewhat less efficient
machine instructions being generated with the JIT, since the
interpreter is preempted from running. Unquote. If such is the
case, would running (micro)benchmarks with that option be in the
interests of projecting Java favorably ?

TIA

swpalmer · January 23, 2004, 5:37pm

[quote]And something I tripped on in an IBM site:
Quote: …specifying -Xcomp will result in somewhat less efficient
machine instructions being generated with the JIT, since the
interpreter is preempted from running. Unquote. If such is the
case, would running (micro)benchmarks with that option be in the
interests of projecting Java favorably ?

TIA
[/quote]
Ah of course, the VM will have no runtime profiling info, it won’t be able to make good guesses as to what branches are more or less likely to betaken for instance. (I think some processors have a branch instruction that hints if it is likely to be taken or not to help the processor pre-fetch optimally)

I guess it is best to use warmup periods when benchmarking.

Jeff · January 23, 2004, 7:04pm

[quote]Much appreciate the info folks.

Is the -Xcomp option equivalent to setting the CompileThreshold
size to 0 ?
[/quote]
AIUI yes this is exactly what it does.

It would get you closer. The issue is that compiling all that code can be a big start-up load. Client handles this by not being as deep an optimizer as server. It leaves the last 10% or so on the table in order to cut the time it takes to compile by quite a bit.

So -Xcomp with server is likely to take longer to start up then even -Xcomp with client, but the result will run faster. You shouldnt see GUI “ramp-up” but it will be longer til your GUI comes up at all.

erm. Well it starts with the size of instruction cache your CPu has. You basically want an entire in-lined loop to fit in the cache or you will end up thrashing the cache as you go around the loop.

Beyond that you’ld have to ask a real bit-twiddler, which Im not. I generally trust Hotspot to do it right for me.

Interesting. IBM’s VM technology is uniquely theirs. I don’t think theres much we do with optimization generation and profile info. Rather we take the most aggressive option and then back off during run-time if need be.
But for their VM yes this might be an issue.
-Xcomp is kind of a shortcut. Its probably always better in terms of accuracy to really warm up the VM unless you intend to -Xcomp your actual program.

Ofcourse accuracy of any microbenchmark is generally so suspect anyway that this is sort of gilding the lilly.

swpalmer · January 23, 2004, 9:42pm

[quote]erm. Well it starts with the size of instruction cache your CPu has. You basically want an entire in-lined loop to fit in the cache or you will end up thrashing the cache as you go around the loop.
[/quote]
With cache sizes what they are these days, I imagine you could get away with a fair bit.

princec · January 24, 2004, 9:24am

The best approach is to try a whole bunch of different values, and compare the speed of the resulting code. The bigger the MaxInlineSize, the longer the compilation takes, too - by quite a significant amount. I’ve found that 16 is a good compromise.

In fact I run Eclipse with this configuration:

-server -XX:CompileThreshold=1500 -XX:MaxInlineSize=16 -XX:FreqInlineSize=32 -XX:+UseParallelGC -Xms128m -Xmx192m

which gives me, after a short while, a very fast IDE that doesn’t constantly thrash the swapfile and starts up quickly too.

Cas

NVaidya · January 24, 2004, 1:40pm

Muchos Gracias again !

OK ! Though I arrived at the options of -server with
a CompileThreshold size of 1500 (something which
princec already seems to be using while I was taking
my hand around my head to reach my nose), I think
my logic is possibly somewhere faulty in retrospection.

Let’s see the “facts” that I have gleaned - correct me
if I’m wrong:

o Server is a “deeper” optimizer than client, i.e.,
server will take a longer time to compile a block of
code than client.

o Hotspot will not compile a block of code unless
that block is hot, i.e., it thinks that compiling
a non-hot block is worth not the effort and that
letting the block run in interpreted mode may actually
be faster than trying to spend time to compile it.

o the thresh hold size of client is 1500 and that of
server is 10000.

Given the above, on first impulse, I would have actually
surmised that the threshold size of server would be
smaller than that of the client if app. performance is
all that matters after app. realization. IOW, if
I have already made the decision to use the server
for what it is, then why keep its default threshold size
to be higher than that of the client ? Is it because
10000 units in server mode is not directly equivalent
to that many in client mode ?

Again, given the above, and if I want to use the server
option, and assuming that the numbers 1500 and 10000
are sacrosanct, I think the threshold size
should optimally be a number greater than 1500
and less than 10000, with the twin objectives of
getting a start-up time comparable to that of client
(noting again that the server is a deeper and slower
optimizer) and at the same time being much more
compilation aggressive than the default size of
10000 would permit.

Agreed, the best numbers may have to be determined
by trying out some values as princec says,
but I just want to make sure that my understanding
about the various aspects are correct.

Also, I gather that setting Xms and Xmx values to be
the same might be easy on the Hotspot, but I haven’t
tried to examine the effect on the overall
performance.

TIA

Jeff · January 24, 2004, 11:23pm

Pretty much all correct.

The reason server’s threshold is higher is because it does go “deeper” and thsu only wants to do that on code where it really matyters. Look at it this way, the deeper optimizing raises the bar as to how many times the interpreter woudl have to run the code to equal the cost of compilation, see?

Server VM was actually the cirst VM. It evidenced a performance problem on GUI apps due to its higher threshold beause GUIS woudl come up and run first interpreted. Clietn was invented to hanbdle this. The Clinet threshold was dreduced so the GUIs woudl be compiled right away. With a lower threshold, it can’t afford to go as deep.

See?

NVaidya · January 26, 2004, 5:48pm

Jeff wrote:

Alright, I see now how the equation is balanced.

OK ! Thanks again, and I would suggest that the Sun’s performance FAQ have more of these technical details, especially the one on “difference between client and server”. One of the 3D applications that I’m developing has a particle tracking module that does millions of polymorphic calls and floating point operations. With the server option, the performance is virtually doubled. Wish the SSE boosts were available in the client mode too!

Much appreciate all the deep enlightenment

swpalmer · February 5, 2004, 2:07am

I just noticed this in the 1.5.0 beta 1 release notes…

-XX:MaxGCPauseMillis=nnn
A hint to the virtual machine that pause times of nnn milliseconds or less are desired. The vm will adjust the java heap size and other gc-related parameters in an attempt to keep gc-induced pauses shorter than nnn milliseconds. Note that this may cause the vm to reduce overall throughput, and in some cases the vm will not be able to meet the desired pause time goal.

-XX:GCTimeRatio=nnn
The ratio of GC time to application time.
1 / (1 + nnn)

For example -XX:GCTimeRatio=19 sets a goal of 5% of the total time for GC.

from http://java.sun.com/j2se/1.5.0/docs/guide/vm/gc-ergonomics.html

Should be quite useful for games if they work well. These work with the parallel collector which will adapt generation sizes automatically to try to meet the requested goals. Cool.

William · February 5, 2004, 5:32am

Depending on how accurate and reliable it is, the MaxGCPauseMillis could be great for real-time applications.

princec · February 5, 2004, 7:27am

This is one of the best bits of news I’ve heard about the JVM in ages! I think we were clamouring for this about 3 years ago when JGO just started. It’s especially helpful to me doing soft realtime TV graphics.

Cas

Jeff · February 6, 2004, 7:49pm

Nice. I was asking for this too but the VM guys never told me they had actually gone and done it!

Ofcouse now we are going to about 165 CN posts about "I set my VM for a Max pause of 0 because I don’t want GC pauses. Why does it now give me Out Of Memory??? "

“Noone ever lost money understimating the intelligence of the average american.” – anonymous, somtimes attributed to PT Barnum

erikd · February 6, 2004, 11:16pm

As I read the notes, I think it doesn’t really guarantee that GC won’t take more time. It says it’s a hint that GC should attempt to take less than the desired time. Doesn’t that mean that if it is on the verge of throwing out of memory exceptions, it will take more time anyway (which would be a Good Thing)? An attempt is not guaranteed to succeed is it?

Still nice though

princec · February 7, 2004, 8:16am

It claims to be able to auto adjust itself depending on how you tune it. So, basically, it would appear that it’s highly tuneable, and experimentation per application should yield the desired result.

It’s still not an excuse to go creating tons of objects in inner loops yet - I’d wait for escape analysis to make an appearance here - but it lessens the amount of tuning and coding that the developer has to do to get reliable constant framerates, and that’s a good thing.

Cas

NVaidya · February 7, 2004, 4:33pm

Anyone notice any significant performance change between
1.4.2_* and 1.5.0 with respect to pre-1.5.0 language features ? I don’t seem to find anything markedly different in a few tests that I have run. But haven’t really given the new one a good shakedown yet.