Programming to the cache

On a Mac Java 6 is only implemented as a 64-bit server VM… -server does nothing. There is no Java 6 client VM on a Mac. As I understand it Apple decided to implement only a 64-bit JVM since they are heading towards 64-bit throughout OS X. The only 64-bit VM that was available to port was the server VM.
Conversely there is no server VM for Java 1.5 on a Mac. But the -server flag is used to adjust some VM parameters to things more suitable for a server (like compile thresholds and heap settings) but it is still running on the 1.5 client VM. Also Java 1.5 on a Mac is using a 32-bit VM. I’m surprised at the large difference you saw with the -server option on the Mac.

Scott

Disabling bounds checks is out of the question I suppose :slight_smile:

Sometimes it would have been nice to have some known cases where bounds checks for random array access are eliminated.
For example a case:

  • length of the array is power of 2
  • the array is declared as final (power of 2 and final just to make the check for bounds check elimination quicker).
  • the index into the array is AND masked with array.length - 1
    The AND operation would be cheaper than a bounds check, right?

final int[] myArray = new int[0x1000];

public int getMyValue(int index) {
   return myArray[index & 0xfff];
}

This currently doesn’t work unfortunately, and adding the ‘& 0xfff’ just makes it slower.

How about

Object[] myArray = new Object[65536];
char i = 12;
myArray[i] = new Object();

;D

I guess that should have been: new Object[65535] :stuck_out_tongue:

Anyway, do not overestimate the overhead of array bound checks.

With sun.misc.Unsafe you (can) have array-access without bound checks, and it’s hardly faster, if at all!!

Some people forget that there is a null-check for every reference that is followed, even for arrays. So that silly value[p] involves both a bounds check and a null-check!

In most cases, the JVM just removes all those silly checks, even for non-trivial array access. For true random-access (still ‘random’ after the JVM recursively inlined it, where it can find ‘limits’ of your ‘random’ variable, which it will use to remove the checks), yes, true random access, you’ll have that slow loop… and sun.misc.Unsafe or JNI to the rescue!

I guess not :stuck_out_tongue:


new Object[65535];
char i = 65535;
myArray[i] = new Object();

barfs an ArrayOutOfBoundsException

IIRC the thing with Unsafe is that the indices are byte offsets, which is a bit of a pain when retrieving ints or something.
I’ve done tests with JEmu2 and when using Unsafe it definitely is noticably faster but I had to rewrite a lot of the algorithms in order to get the most from Unsafe, avoiding offset<<=2 and such to convert from int to byte offsets and keeping track of just one pointer instead of a pointer and an index etc.
In the end I disabled the ‘Unsafe’ code, because of the non-standard nature of it. It’s too hacky and who knows it’s going to be there in all JVM implementations, working exactly the same way.

Can you tell more about that? Is there a -XX switch to adjust that limit? What options have to be set in the debug commandline (Does -XX:PrintCompilation also show that effect)?

[quote]Let’s look at this loop:
Code:

long[] data = new long[1024];
long xor = 0;
for(int i=0; i<data.length; i++)
xor ^= data[i];

It’s obvious that this loop basicly waits a lot of the time. You can do quite a bit meanwhile, in the same thread.

it’s actually so bad, that you can do this, in exactly the same time:
Code:

long[] data = new long[1024];
long xor = 0;
for(int i=0; i<data.length; i++)
xor ^= swap64(data[i]);

public static final long swap64(long val)
{
long swapped = 0;
swapped |= (val & 0x00000000000000FFL) << 56;
swapped |= (val & 0x000000000000FF00L) << 40;
swapped |= (val & 0x0000000000FF0000L) << 24;
swapped |= (val & 0x00000000FF000000L) << 8;
swapped |= (val & 0x000000FF00000000L) >> 8;
swapped |= (val & 0x0000FF0000000000L) >> 24;
swapped |= (val & 0x00FF000000000000L) >> 40;
swapped |= (val & 0xFF00000000000000L) >>> 56;
return swapped;
}

We’re adding maybe 32 instructions per loop-iteration, and we have no speed-penalty at all.
[/quote]
Hmm, I actually just tested this, but I’m seeing something else:
The code with ‘swap64’ is about 15 times as slow on my PC (which I call quite a speed penalty :P), on both -client and -server (java 1.6.0 update 7) running on a 32bit dual core Intel.
I could imagine things like this might vary a lot between CPUs and/or RAM speeds and such. Riven, on what CPU and java version did you test this?

64 bit OS (Vista)
64 bit JVM (1.6, Sun, server vm)

I bet you get really poor performance on long primitives in 32 bit JREs.

You might want to create an example that works with ints, and a swap32 method

Yes, that’s what I tried at first but that gave a similar result. The difference was less severe than with 64 bit, but it was still significantly slower…

[quote]Anyway, do not overestimate the overhead of array bound checks.
[/quote]
One thing I did in JEmu2 in the Sega Y-Board driver that really gave me nice performance boost was to load the sprite roms into long[] arrays. Because sprite pattern data is stored in 8 byte blocks (for 16 pixels), I only had to do one array access to retrieve 16 pixels and shift-mask the pixels out of the long value. That really made a big difference (>10% overall increase in FPS).
TBH I’m not 100% sure it is entirely accountable to less bounds checks due to array access though, but less bounds checks were the reason for packing the data into long[] :slight_smile:

I’ll try to see if I can make a small benchmark that shows the behaviour I described before.

But well, you know how it is… you only have to bump your PC case and the JVM takes another optimalisation path :wink:

[quote]…64 bit OS (Vista)
64 bit JVM (1.6, Sun, server vm)…
[/quote]

[quote] …running on a 32bit dual core Intel…
[/quote]
I think that the difference in performance come form the use of long they are slow on 32bit and as fast as int on 64bit cpu, that’s why you see a performance penality, replacing the whole includind the array with int should do the job. Riven sample only rquiere trivial ops on register when there size fit the OS (int/long 32/64) and it should be a lot faster than array acces wich requiere bounds+memory acces.

Have you read the last 3 posts? :wink:

oups… ::slight_smile:

… forgot to read the end of your post

[quote]I bet you get really poor performance on long primitives in 32 bit JREs.
[/quote]

Just reran the 32 version of the test, but with swap32(int) made the test about 9 times as slow as without.
So yeah, using long on a 32bit cpu skewed the results somewhat, but the result is still that I’m not seeing what Riven described.

I’m really busy these days. Excuse me for the delay - I’ll post a benchmark (for review) ASAP.

You also need to prove that x doesn’t escapes. As for ‘not very far’/‘per function’ as the ‘restricting’ code is further away it is harder to guaranty that it doesn’t escape.

Also it is not till Java 6-7 that escape analysis was really added (and enabled by default) or completed. Undoubtedly as that analysis evolves, it can more often give guaranties, which in turn can be used to improve other optimisations. Given those entanglements and the correlated ‘lock-step’ evolution, it is perhaps not that weird how some optimisations work and others don’t.

Adding that it’s programmed in c/c++ they sure get my pitty. Maxine seems to be pretty sweet with respect to that.

ps. to do the thing raven talked about; see the output have a look here: http://weblogs.java.net/blog/kohsuke/archive/2008/03/deep_dive_into.html

[quote]My impression is that the JVM specs guarantee literally nothing about any sort of contiguous allocation - by my reading, an array of floats could very well be scattered all across the heap. Apparently even the stack is usually not contiguous in Sun’s VMs.
[/quote]
I’m pretty sure java.nio.MappedByteBuffer is contiguous. I use it for volatile ringbuffers.

True, though in that example, since x is a float, it’s a bit easier since it’s bound to its declared scope and can’t escape via a function call or reference stored away somewhere. That’s what’s so nice about having objects with value semantics in a language, it is dead simple to allocate them to the stack instead of the heap.

Right on - I’ve been screaming bloody murder about how desperately we could use escape analysis for quite a while now, and I’m really hoping that they decide to turn it on by default and actually use it to optimize compiled code (well, that is, assuming that all of the other (better, IMO) options for dealing with these things are discarded on the basis that they would require language changes, such as, say, value semantics, as mentioned above). As far as I am aware, none of the current JVM builds use escape analysis for stack allocation or inlining, which is a shame. Supposedly they are used for lock coarsening, but based on that link even that seems to be hit or miss, only helping in fairly trivial situations. Hopefully sometime over the next few versions things will improve…

I know I’m late coming to this thread, but: Wha…? Since when? Proof?

Not really proof but for a piece of code I had, I replaced x*2 with x<<1 and it decreased the cpu usage a noticeable amount. Something like 10-15%.