Quiet in here these days...

… guess that means it’s fast enough now.

Cas :slight_smile:

well 1.6 IS really nice - now we just need world+dog to upgrade…

Good observation, wrong conclusion :slight_smile:

All VM performance topics have been beaten to death…

What we need is:

  • {primitive}Buffer performance equal to {primitive}[] performance - in all cases.

SIMD:
[list]
[li]the Intel core2duo handles 4 values in 1 operation

  • all other SIMD enabled CPUs handle 4 values in 2 operations
  • Sun Java VM can only handle 4 values in 4 operations

[/li]
[/list]

Java 6.0 is nice, indeed, but due to changes in the CPU implementations, the gap is widening…!

But why start yet-another-topic about it?

Now that Hotspot is opensource, no one prevents you from implementing it :stuck_out_tongue:

1.) I have to admit that this would be beneficial to some kind of applications, however to lobby for this as the almost one of a few things missing in java is a bit of unrealistic.

2.) Don’t you think that especially for buffers almost anything possible has been done to archieve high performance in such a critical area?

Well 4 values in 4 instructions. But that does not mean e.g. a Core2Duo will need 4 cycles to process the data, since its able to handle four integer instructions at once.

lg Clemens

it is able, yes, but the VM isn’t sending it the right instructions to do so, so even the core2duo is stuck at 4 in 4, instead of 4 in 1

To show what I mean:
Doing math in pure Java, was 2.4x (40% due to pointer-arithmetic) slower than using JNI and invoking a native method that used SIMD.
On a core2duo that difference might very well be 4.4x

I’m talking about simple loops like this:


for(int i=0; i<n; i++)
   (*dst++) = (*op1++) * (*op2++); // C-compiler turns this into SIMD

vs

for(int i=0; i<n; i++)
   dst[i] = op1[i] * op2[i];

An RFE has been filed about this: 6509032. Check it out in Sun’s bug database. If this were implemented then it should completely eliminate the performance difference between Buffer get()/put() operations and array operations. Please vote for it. Right now it doesn’t even have a responsible engineer assigned to it.

Work is ongoing in the Java HotSpot VM to enable better use of SIMD instructions. Stay tuned.

In my recent brush with JVM reference tricking, I had some assumptions about how the VM handles it’s reads/writes for arrays and Buffers.

for an array:


array[i] = 5; // array is simply a pointer

thus: WRITE 5 AT ((int)array + i)

for a buffer:


buffer.put(i, 5); // buffer is simply a pointer, with 'base' at offset N, thus:

WRITE 5 AT (FETCH((int)buffer + base_field_offset) + i)

So how can a Buffer (pointer-to-a-pointer + offset) ever get as fast as an array (pointer + offset)?

Again, these are ‘just’ assumptions, don’t be too harsh if I’m way off. :slight_smile:

I voted for 6509032, read the description, and couldn’t see how this was taken into account.

Hm… the base-pointer could be cached by the VM ofcourse, in case of loops.

But the solution mentioned in 6509032 certainly does not address that.
It seems to assume the direct ByteBuffer (object!) is allocated in direct-memory.
Further, it only (seems to) equal the performance of heap- and direct-buffers, not buffers vs. arrays.

As you pointed out, the base pointer of Buffers (direct or non-direct) is immutable, so in the case of loops or rapidly repeated method calls (which will likely be inlined) you won’t have to do the additional dereference of the Buffer because the base pointer’s value will be fetched once at the top of the loop. This should make both heap-based and direct Buffers as fast as array accesses in all cases.

Blink twice if it’s autovectorization and cough once if it is going to be in se7.

Better use of SIMD? I wasn’t aware that it used it at all, when did that sneak in?

Java 1.4 (see the performance notes, somewhere outthere on sun.com)

Let me add two questions:
Is it just me who is waiting every week for the next JDK 7 build, hoping it’ll come with tiered compilation?
Anyone tried the IBM 6 pre-release? It has a not yet documented but nonetheless mentioned feature called “Data sharing between JVMs: Ahead Of Time (AOT) compiled code” - sounds interesting…

I finally found out why my float[] vs FloatBuffer results always had another winner, in similair benchmarks…

I just finished my best performance benchmark to date, because each test-case is run inside it’s own VM. Everything is warmed up properly by the server VM before the measurements begin.

Java 6.0 server VM:

http://www.songprojector.com/temp/jvm_bench_int_math.PNG

http://www.songprojector.com/temp/jvm_bench_float_math.PNG

Benchmark
Calculate the cross-product of 2 data-sets

Results
1/3rd of the benchmarks are 20-50%slower, while the data-set differs only 4K in size
1/12th of the benchmarks are 60-75% slower, while the data-set differs only 4K in size
The 3 types seem to have the a distinct ‘phase offset’, causing the ‘winner’ to be fairly predictable for a certain data-set size.

after a certain data-set size, the int[]/float[] lose 50% performance, and lose their ‘spikes’

Question
Who’s fault is this? Hotspot? OS ? CPU ? RAM ?

If they just gave developers vector primitives, and the VM the appropriate backend compilers most of the work would be done. With vector primitives the VM could compile to scalar code if the proper instruction set isn’t there, to SSE2 on x86 machines, Altivec on Power, and VIS on Sparc, all the while allowing the developers to do the vectorization themselves (and relatively easily at that). Because frankly, from what I’ve seen autovectorization is never likely to perform well enough.

I just can speak about GCC and the Intel-C-Compiler but its quite tricky to get autovectorization for loops that do little more than one operation over a large set of data which make code often more complicated than with vector primitives.

I think AutoVectorization is a great tool for C2 to generate even better code because there are maybe loops out there which trigger this enhancement, but for programmers I guess explicit Vector-Routines would be best. At least the programmer would have direct control over what happens, and does not have to rely on some optimization magic to get stuff built the way its intended.

lg Clemens

SSE2 instructions have been used in the HotSpot server VM and more recently the client VM for floating-point operations, but only with scalar values so far. What I meant by “better use” is using these instructions in vector form.

Sounds like a cache issue to me – that in some situations data that needs to be in the cache is being evicted and reloaded from main memory.

I’m not involved with the development at all, so no comment from me, but I’ve pointed the responsible engineer at this thread so maybe we’ll hear it from the source.

That would be really great :slight_smile: