Vectorization-Optimization RFE accepted

Hi there,

I recently opend up a RFE for adding vectorization optimizations to hotspot-server which should help especially for games and other throughput-computing running on modern processors.
In fact its today the one of two reasons why we still use C for our most heavy number crunching stuff, the whole framework is written in java but the numbers are crunched in C about 45% faster than with Java :frowning:

So if you’re interrested give it a vote under: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6340864

lg Clemens

A problem with floating point vectorization may be maintaining Java’s relatively strict rules on FP results. This prevents the use of the multiply-accumulate instruction available on some CPUs (it uses extra precision for the intermediate result which Java forbids).
There was a JSR aimed at relaxing some of these rules, but it was withdrawn. :-[

http://jcp.org/en/jsr/detail?id=84
This JSR proposes extensions to the JavaTM Programming Language and Java Virtual Machine that support more efficient execution of floating point code.

Withdrawn 2002.03.01. Due to the general absence of interest in the community, the Specification lead withdrew the JSR.

Only 45% faster! I don’t get out of bed for gains that small. :wink:

Seriously though, for me interesting performance gains start at a minimum factor of 2.

I fear such an optimisation will be only done in a few undefined cases, where the JIT sees it as a possibility, so you’ve got to do trial-and-error and see when the JIT ‘accepts’ your code and does this optimisation.

Wouldn’t it be a heck of a lot easier to use the native direct buffers we got now, make a little API in C/C++ and a wrapper around it? You’re normally processing lots of vectors at once, so you could do a single JNI call reducing your 45% performance improvement to 44.9%. The advantage is that it’s guaranteed, max out performance.

Some RFEs for FMA still seem to be “in progress”… but not sure of the level of interest…

http://bugs.sun.com/bugdatabase/view_bug.do;:YfiG?bug_id=4851642

http://bugs.sun.com/bugdatabase/view_bug.do;jsessionid=289794b1aa0f08ffffffffb511f713ed5723c:YfiG?bug_id=4919337

I for one see a 45% performance gain as pretty significant, especially in realtime or game development.

I also think so, especially when keeping in mind that some parts even intel’s C compiler was not able to vectorize.
So its basically the same with C compilers, of course you have to write code the runtime can digest, it has never been different.
You can’t use Primitives-Wrappers all the time in your code for algorythmic code and blame the jvm for its speed - sure it works but its slow by design. Its even the same with bounds-check removal, too complex statements cannot be optimized of course.

So why have we optimizing runtimes at all? Simply using an interpreter-vm and write all the time-critical stuff in C/C++ using JNI.
Hmm, sorry … I do not wan to loose platform independence nore do I want to waste my time with JNI.

This maybe helps only as little as 10% in your code. Adding escape analysis based stack allocation maybe adds another 5-10%. You can’t expect such major and heavily optimized runtime systems like java to make jumps of x2 in performance anymore :wink:
lg Clemens

[quote=“Linuxhippy,post:7,topic:25050”]
I said native wrappers around the library implemented in c++. That’s something else entirely. Think: JOGL or LWJGL style.

ofcourse it can be a general speed increase for all applications, but for complex games you need guaranteed optimisation. The only way to force SIMD is by using wrapped native code. I’m not saying this RFE is bad, it’s good.

I understood your post, I was referring to the topic that someone would write code that is JIT specific - I just wanted to show that its not different than today - even today you need to take care that your code performs well on your JVM.

native == assmbler?
Here it also depends which compiler the user has installed - especially on platforms where its common that users compile their programs theirself.

However, peace :wink:

lg Clemens

How about a JNI wrapper to the blas - http://www.netlib.org/blas/? Most Fortran supercompilers work by re-expressing loops as a sequence of linalg operations that are then passed to hand optimised libraries that can take account of tiling, prefetching, simd ops etc.

With reference to the thread about a pure Java physics engine - the kernel of physics engines are easily vectorisable like this (ODE’s kernel is an iterative matrix decomposition).

Unfortunately, for efficiency, you would probably end up[ with the matrices being in (Byte/Int/Float/Double)Buffer’s which are not so convenient to use from the Java side. Without the JNI overhead even a 3x3 matrix multiply ould probably benefit from vectorisation, but the JNI overhead would make the minimum worthwhile matrix size rather larger.

I seem to recall that some JVM have some sort of fast path alternative to JNI which is used for special operations such as implementing System.currentTimeMillis. I don’t know what conditions are required for this to work.

[quote=“Mark Thornton,post:11,topic:25050”]
For natives which are known to jvm there is a plently of possibilities of inlining. Function can be expanded directly into calling method, or call can be converted to C call to special version of non-JNI implementation instead of java call going to JNI. I don’t think there is any way to get such fast path for user libraries.

I think that requesting some simd operations on java.lang.Math or somewhere around it is a sane proposal. Implementing it on your own through JNI is probably a killer from performance point of view.

The java.math package might be an appropriate location. Doing it as functions avoids the problem of Java’s rather strict rules for floating point math and doesn’t require any cleverness from the compiler.

For example if you want the compiler to recognise an opportunity for simd in

float[] a, b;
float sum=0;
for (int i=0; i<a.length; i++)
   sum += a[i]*b[i];

Then the PowerPC can’t use its multiply/accumulate instruction. On the otherhand, if you have a method


native float dotProduct(float[] a, float[] b);

Then we can happily implement that using the mac instruction provided the method’s contract allows that sort of variation (i.e. it doesn’t require bit identical results on all platforms).

Better than java.math would be to promote javax.vecmath to the core JDK and add some methods to the classes there. For example, something like the Matrix4f.transform() method now takes a FloatBuffer and then the JVM/libraries could push these through using the native CPU SIMD instruction set. For things like skinning and some of the large-scale sci-viz applications, this would be a tremendous performance boost.

I am infavour of this…a unified vecmath library would do wonders. E.g, all scenegraphs having the same math library as JOODE (or something else) would be really nice…

However, I dont like vecmath alot…so maybe we should take out an RFE to improve the API?

DP

Vecmath is much too specialised for 3d graphics. If one wanted to do general vector/matrix operations you wouldn’t start from there.

I still think we need to look at a BLAS approach. OK, maybe JNI isn’t the best way to go (though you probably don’t need to work on many vectors at once to win over the interface overhead) - but following the discussion of moving vecmath into the core, then the BLAS would be a far better choice at which point you’ve only got a normal subroutine call overhead which is insignificant.

A good example might be character skinning where for each vertex, you have an accumulating product with each bone influence and a scalar weight. If you set your data up correctly (I have derived classes from the native buffer that handle Vector3fBuffer etc) then you can do your character skinning in one BLAS call.

As I mentionned previously, there are highly optimised BLAS routines for every processor architecture out there including versions which will handle long/short vectors and support tiling, prefetching and other superoptimisations.

And the really nice thing is that you don’t need to work too hard in the compiler to analyse many loops into a sequence of equivalent BLAS calls (particularly if you are allowed to relax the IEEE rules on operation order). This is what the Cray Fortran compilers did back in the late 80s.

Relaxing the operation order of expressions and statements in Java is probably a non starter. Not enough interest in the potential benefits and too many people who find floating point baffling enough as it is.

I suppose that such work could be done somewhere on driver level - emulating some not GPU-accelerated vertex shaders on cpu. Vertex/pixel programs are wonderful target for any kind of vector optimalizations. Additionally, you don’t have to care about exact floating point semantics - there are very loose on GPUs anyway.

Anyway, current movement is rather oposite - how to move some non graphic related computations from CPU to GPU. You can get 1 or 2 orders of magnitude improvement with that, as opposed to 2-3 times max for using vector CPU instructions.

Sorry for the necro, but it looks like a Hotspot engineer has started working on this RFE. Better late than never? :o