Programming to the cache

swpalmer · June 29, 2008, 12:33am

ewjordan:

Java Results (Apple JVM 1.6, -server - I’ll check this on a Sun VM pretty soon, too)
…
…
@jezek2: on the Mac, the -server mode in 1.6 doesn’t do very much, I think it just changes the way the heap sizing operates, which is irrelevant for these tests. I tried in 1.5, though, and got the following:

Apple 1.5 without -server:
Normal FPS: 7530.917239361543
Interleaved FPS: 7536.807696165959
Object FPS: 6943.34652777474

Apple 1.5 with -server:
Normal FPS: 22029.56632155148
Interleaved FPS: 21603.01266973487
Object FPS: 18347.889937613505
Local FPS: 13744.841045723999

So at least on OS X, 1.5 should DEFINITELY be used with the -server flag. I included that “Local FPS” entry in the last one because that shows that at least Apple’s 1.5 -server has a speed problem passing arrays as local variables, I’m not sure why - I omitted that from all the other tests because it never differed.

On a Mac Java 6 is only implemented as a 64-bit server VM… -server does nothing. There is no Java 6 client VM on a Mac. As I understand it Apple decided to implement only a 64-bit JVM since they are heading towards 64-bit throughout OS X. The only 64-bit VM that was available to port was the server VM.
Conversely there is no server VM for Java 1.5 on a Mac. But the -server flag is used to adjust some VM parameters to things more suitable for a server (like compile thresholds and heap settings) but it is still running on the 1.5 client VM. Also Java 1.5 on a Mac is using a 32-bit VM. I’m surprised at the large difference you saw with the -server option on the Mac.

Scott

erikd · June 30, 2008, 8:34am

Disabling bounds checks is out of the question I suppose

Sometimes it would have been nice to have some known cases where bounds checks for random array access are eliminated.
For example a case:

length of the array is power of 2
the array is declared as final (power of 2 and final just to make the check for bounds check elimination quicker).
the index into the array is AND masked with array.length - 1
The AND operation would be cheaper than a bounds check, right?


final int[] myArray = new int[0x1000];

public int getMyValue(int index) {
   return myArray[index & 0xfff];
}

This currently doesn’t work unfortunately, and adding the ‘& 0xfff’ just makes it slower.

Markus_Persson · June 30, 2008, 1:16pm

How about

Object[] myArray = new Object[65536];
char i = 12;
myArray[i] = new Object();

;D

Riven · June 30, 2008, 4:31pm

I guess that should have been: new Object[65535]

Anyway, do not overestimate the overhead of array bound checks.

With sun.misc.Unsafe you (can) have array-access without bound checks, and it’s hardly faster, if at all!!

Some people forget that there is a null-check for every reference that is followed, even for arrays. So that silly value[p] involves both a bounds check and a null-check!

In most cases, the JVM just removes all those silly checks, even for non-trivial array access. For true random-access (still ‘random’ after the JVM recursively inlined it, where it can find ‘limits’ of your ‘random’ variable, which it will use to remove the checks), yes, true random access, you’ll have that slow loop… and sun.misc.Unsafe or JNI to the rescue!

erikd · June 30, 2008, 5:23pm

I guess not


new Object[65535];
char i = 65535;
myArray[i] = new Object();

barfs an ArrayOutOfBoundsException

IIRC the thing with Unsafe is that the indices are byte offsets, which is a bit of a pain when retrieving ints or something.
I’ve done tests with JEmu2 and when using Unsafe it definitely is noticably faster but I had to rewrite a lot of the algorithms in order to get the most from Unsafe, avoiding offset<<=2 and such to convert from int to byte offsets and keeping track of just one pointer instead of a pointer and an index etc.
In the end I disabled the ‘Unsafe’ code, because of the non-standard nature of it. It’s too hacky and who knows it’s going to be there in all JVM implementations, working exactly the same way.

krausest · July 1, 2008, 6:39pm

Can you tell more about that? Is there a -XX switch to adjust that limit? What options have to be set in the debug commandline (Does -XX:PrintCompilation also show that effect)?

erikd · July 16, 2008, 1:35pm

[quote]Let’s look at this loop:
Code:

long[] data = new long[1024];
long xor = 0;
for(int i=0; i<data.length; i++)
xor ^= data[i];

It’s obvious that this loop basicly waits a lot of the time. You can do quite a bit meanwhile, in the same thread.

it’s actually so bad, that you can do this, in exactly the same time:
Code:

long[] data = new long[1024];
long xor = 0;
for(int i=0; i<data.length; i++)
xor ^= swap64(data[i]);

public static final long swap64(long val)
{
long swapped = 0;
swapped |= (val & 0x00000000000000FFL) << 56;
swapped |= (val & 0x000000000000FF00L) << 40;
swapped |= (val & 0x0000000000FF0000L) << 24;
swapped |= (val & 0x00000000FF000000L) << 8;
swapped |= (val & 0x000000FF00000000L) >> 8;
swapped |= (val & 0x0000FF0000000000L) >> 24;
swapped |= (val & 0x00FF000000000000L) >> 40;
swapped |= (val & 0xFF00000000000000L) >>> 56;
return swapped;
}

We’re adding maybe 32 instructions per loop-iteration, and we have no speed-penalty at all.
[/quote]
Hmm, I actually just tested this, but I’m seeing something else:
The code with ‘swap64’ is about 15 times as slow on my PC (which I call quite a speed penalty :P), on both -client and -server (java 1.6.0 update 7) running on a 32bit dual core Intel.
I could imagine things like this might vary a lot between CPUs and/or RAM speeds and such. Riven, on what CPU and java version did you test this?

Riven · July 16, 2008, 7:16pm

64 bit OS (Vista)
64 bit JVM (1.6, Sun, server vm)

I bet you get really poor performance on long primitives in 32 bit JREs.

You might want to create an example that works with ints, and a swap32 method

erikd · July 16, 2008, 8:26pm

Yes, that’s what I tried at first but that gave a similar result. The difference was less severe than with 64 bit, but it was still significantly slower…

[quote]Anyway, do not overestimate the overhead of array bound checks.
[/quote]
One thing I did in JEmu2 in the Sega Y-Board driver that really gave me nice performance boost was to load the sprite roms into long[] arrays. Because sprite pattern data is stored in 8 byte blocks (for 16 pixels), I only had to do one array access to retrieve 16 pixels and shift-mask the pixels out of the long value. That really made a big difference (>10% overall increase in FPS).
TBH I’m not 100% sure it is entirely accountable to less bounds checks due to array access though, but less bounds checks were the reason for packing the data into long[]

Riven · July 16, 2008, 8:35pm

I’ll try to see if I can make a small benchmark that shows the behaviour I described before.

But well, you know how it is… you only have to bump your PC case and the JVM takes another optimalisation path

DzzD · July 17, 2008, 5:47pm

[quote]…64 bit OS (Vista)
64 bit JVM (1.6, Sun, server vm)…
[/quote]

[quote] …running on a 32bit dual core Intel…
[/quote]
I think that the difference in performance come form the use of long they are slow on 32bit and as fast as int on 64bit cpu, that’s why you see a performance penality, replacing the whole includind the array with int should do the job. Riven sample only rquiere trivial ops on register when there size fit the OS (int/long 32/64) and it should be a lot faster than array acces wich requiere bounds+memory acces.

Riven · July 17, 2008, 8:27pm

Have you read the last 3 posts?

DzzD · July 17, 2008, 11:56pm

oups… :

… forgot to read the end of your post

[quote]I bet you get really poor performance on long primitives in 32 bit JREs.
[/quote]

erikd · July 18, 2008, 8:25am

Just reran the 32 version of the test, but with swap32(int) made the test about 9 times as slow as without.
So yeah, using long on a 32bit cpu skewed the results somewhat, but the result is still that I’m not seeing what Riven described.

Riven · July 18, 2008, 5:08pm

I’m really busy these days. Excuse me for the delay - I’ll post a benchmark (for review) ASAP.

Mr_Light · August 13, 2008, 4:32pm

ewjordan:

One thing that I’ve noticed while browsing the source - HotSpot is able to prove certain facts about some of your variables to itself, for instance:
int[] myArray = new int[200];

if (x > 5) {
  if (x < 100) {
    //Hotspot knows that x is between 5 and 100 here, so in theory could skip the bounds check (I haven't verified that it does)
    myArray[x] += 1;
  }
}
The problem is, in a lot of random access cases the proof that you’re not going outside the range is a lot more complicated than some nested ifs, and I’m not sure how deeply things get analyzed and how far variable ranges propagate - not very far, I’m sure. Also, from what I can tell most of the JVM’s optimization is purely per-function (please correct me if I’m wrong on this!), so any range restrictions that might be present globally will not be factored in. To some extent this is unavoidable, since knowing whether we’re going outside the bounds of an array in general is equivalent to the halting problem, but I’m sure a lot more could be done here…

You also need to prove that x doesn’t escapes. As for ‘not very far’/‘per function’ as the ‘restricting’ code is further away it is harder to guaranty that it doesn’t escape.

Also it is not till Java 6-7 that escape analysis was really added (and enabled by default) or completed. Undoubtedly as that analysis evolves, it can more often give guaranties, which in turn can be used to improve other optimisations. Given those entanglements and the correlated ‘lock-step’ evolution, it is perhaps not that weird how some optimisations work and others don’t.

Adding that it’s programmed in c/c++ they sure get my pitty. Maxine seems to be pretty sweet with respect to that.

ps. to do the thing raven talked about; see the output have a look here: http://weblogs.java.net/blog/kohsuke/archive/2008/03/deep_dive_into.html

Jack9C · August 16, 2008, 3:38am

[quote]My impression is that the JVM specs guarantee literally nothing about any sort of contiguous allocation - by my reading, an array of floats could very well be scattered all across the heap. Apparently even the stack is usually not contiguous in Sun’s VMs.
[/quote]
I’m pretty sure java.nio.MappedByteBuffer is contiguous. I use it for volatile ringbuffers.

ewjordan · August 16, 2008, 7:09am

True, though in that example, since x is a float, it’s a bit easier since it’s bound to its declared scope and can’t escape via a function call or reference stored away somewhere. That’s what’s so nice about having objects with value semantics in a language, it is dead simple to allocate them to the stack instead of the heap.

Right on - I’ve been screaming bloody murder about how desperately we could use escape analysis for quite a while now, and I’m really hoping that they decide to turn it on by default and actually use it to optimize compiled code (well, that is, assuming that all of the other (better, IMO) options for dealing with these things are discarded on the basis that they would require language changes, such as, say, value semantics, as mentioned above). As far as I am aware, none of the current JVM builds use escape analysis for stack allocation or inlining, which is a shame. Supposedly they are used for lock coarsening, but based on that link even that seems to be hit or miss, only helping in fairly trivial situations. Hopefully sometime over the next few versions things will improve…

blahblahblahh · September 24, 2008, 7:35am

I know I’m late coming to this thread, but: Wha…? Since when? Proof?

zammbi · September 24, 2008, 7:55am

Not really proof but for a piece of code I had, I replaced x*2 with x<<1 and it decreased the cpu usage a noticeable amount. Something like 10-15%.