SSE / SSE2 - SIMD wrapper

Riven · February 3, 2006, 3:27pm

In several topics the performance of tight loops were discussed. The current implementation of the Sun JVM can’t take advantage of the SIMD properties of SSE(2). When one would need raw performance, SIMD would be a nice solution. Ofcourse these instructions cannot be wrapped directly, but a proper API around them would mean they would become accessible.

Ofcourse not all CPUs have SSE/SSE2 so to keep it platform- (or CPU-) independant, we’d have a fallback to Java-instructions.


FloatMath.add(FloatBuffer a, FloatBuffer b);
FloatMath.sub(FloatBuffer a, FloatBuffer b);
FloatMath.mul(FloatBuffer a, FloatBuffer b);
FloatMath.madd(FloatBuffer a, FloatBuffer b);

FloatMath.lerp(FloatBuffer a, FloatBuffer b, float w, FloatBuffer r); // combination of several SSE instructions

The code would try to do 4 calculations at once, until the remaining data in the buffers is less than 4.

This idea is so obvious, that I feel I’m forgetting something. Also I lack any C-skills and thus have never done any projects involving JNI.
From my point of view it looks like this could be programmed in a rainy afternoon.

Could anybody shed some light on this?

pepe · February 3, 2006, 3:44pm

Does not go in your path, but the direction looks like the same.
http://java.sun.com/j2se/1.4.2/changes.html#vm

Riven · February 3, 2006, 3:46pm

Sun’s JVM supports SSE(2), yes, but not the SIMD part of it. That’s the whole point

pepe · February 3, 2006, 3:54pm

The whole thing with jvm optimisations like that is to code in patterns that the JIT will recognize and optimise.
Which ones of us tried to change a decrementing loop to an incrementing one in order to get rid of bound checks? Most of those that needed to. This is the same principle. (btw, is that still needed?)
Once we know the pattern that the JIT uses to do what is needed, we’ll have it optimized the same way, but will not need to code using special tools. Hopefully, those patterns are generally those of a correct object design.
IIRC, two years ago this was discussed here. We finally came to the idea that such operations should be done on big arrays of data, instead of operations by operations. Thus that would have been the job of a special API instead of special instructions that would be regognized by jit.

Riven · February 3, 2006, 4:01pm

I don’t really see your point

We are both saying these kinds of optimisation can’t (or shouldn’t) be done by the JIT, but by some specialized API.
And we agree both that it is only useful to do on large sets of data.
So it seems we share the same ideas

How hard would it be for anybody to code this API? As I see it, it would be a very small effort.

zingbat · February 3, 2006, 5:58pm

Your idea looks interesting but i doubt you will get much help here. Try posting this on an assembler forum like fasm for example and ask for help there:

http://flatassembler.net/

Or you can try to generate the machine code yourself into memory and then call your generated code from java. Don’t know if this can be done in java. You will find in this resource page a link to Intel docs describing their machine code encoding:

http://board.flatassembler.net/topic.php?t=2530

Finnaly if there is any way to extend the jvm Jit with a dll this would be the best solution. The JIT would be trained to recognized certain specific classes like ‘vector’ and ‘matrix’ and automatically select SIMD instructions if available to do the math. This may be very hard to impossible to do without any support from Sun.

Raghar · February 3, 2006, 6:15pm

Riven:

FloatMath.add(FloatBuffer a, FloatBuffer b);
FloatMath.sub(FloatBuffer a, FloatBuffer b);
FloatMath.mul(FloatBuffer a, FloatBuffer b);
FloatMath.madd(FloatBuffer a, FloatBuffer b);

FloatMath.lerp(FloatBuffer a, FloatBuffer b, float w, FloatBuffer r); // combination of several SSE instructions
The code would try to do 4 calculations at once, until the remaining data in the buffers is less than 4.

This idea is so obvious, that I feel I’m forgetting something. Also I lack any C-skills and thus have never done any projects involving JNI.
From my point of view it looks like this could be programmed in a rainy afternoon.

What you described would be hugely uneffective. Better way would be something like SIMD program ARB. You know, write down some SSE3 instructions, add pipes, and be happy.

SIMDAbstractModule.addProgram(float[] pipes, SIMDProgramARB2 program, int IN_PLACE | LOCAL)

Of course SIMD program would need to have intel syntax to make it more easy to work with intel manuals and decrease amount on unnecessary locks.

instruction destination, source

{
pipe uint p0
pipe uint p1

mov r0, p0
mul p1, p1
add p0, p1
add p0, r0
}

{
p0 data {int} //This doesn’t imply any conversion it’s simply Java way to describe layering of the pipe.
p1 data {int}
}

Linuxhippy · February 6, 2006, 7:47am

[quote]What you described would be hugely uneffective.
[/quote]
Why? This methods could be represented by instrincs and therefor run for “free”.
Furthermore the suggested API has a quite clear design and would also be portable over many architectures that support SIMD instruction, and even on older CPUs there could be many levels of fall-backs.

Don’t agree here. You completly loose platform-independence but furthermore executing your own assembly would render the whole jvm concept useless. no sandbox could be trusted anymore, JIT optimizations must take care of eventuall nonsence happening in the assembly (from where should the JIT know which registers are used or cpu states have been changed?).

lg Clemens

Riven · February 6, 2006, 8:21am

I’m glad you said that, as I just couldn’t figure out how it could be that inefficient.

The C compiler should be smart enough to generate reasonably fast ASM code using xmmintrinc.h. I’ve tried it, but I get compiler-errors from those very same header-files. I fear this will requires some effort of somebody that actually knows what he’s doing, hehe, so not me.

If anybody is interested in doing it, I (and I suppose many others) would be delighted to hear it

AndersDahlberg · February 6, 2006, 4:41pm

Simple benchmark - regular disclaimers about benchmark implementation, coding misstakes etc. apply

Code implements vector multiplication A = AB for 41024*1024 length vectors A and B populated with randomly distributed floats [0,100).
First numbers is with float arrays, second with (direct) FloatBuffers and third is a native implementation (direct bytebuffers used to transfer data).
“Well…” don’t really serve any purpose except possibly preventing the number crunching process from being optimized away (?).

java version:
java version “1.6.0-rc”
Java™ 2 Runtime Environment, Standard Edition (build 1.6.0-rc-b68)
Java HotSpot™ Client VM (build 1.6.0-rc-b68, mixed mode, sharing)

gcc version:
gcc (GCC) 3.4.5 (Gentoo 3.4.5, ssp-3.4.5-1.0, pie-8.7.9)

command used to compile shared library:


gcc -pipe -msse -march=pentium4 -O3 simd_SimdTest.c -o libtestsimd.so -shared -ljvm -lverify -lnet -ljava -lnio -I/opt/sun-jdk-1.6.0/include -I/opt/sun-jdk-1.6.0/include/linux -L/opt/sun-jdk-1.6.0/jre/lib/i386 -L/opt/sun-jdk-1.6.0/jre/lib/i386/client -Wl,-soname,libtestsimd.so -lc  -fPIC  -D_REENTRANT  -W -Wall -Wno-unused -Wno-parentheses -Werror


<snip>
Well...:                        7412,38965.
Array calc took 34631 us.
Well...:                        4627,65234.
Array calc took 35024 us.
Well...:                        1983,89075.
Array calc took 34717 us.
<snip>
Well...:                        578,43506.
Buffer calc took 34490 us.
Well...:                        5744,37891.
Buffer calc took 34895 us.
Well...:                        3716,52808.
Buffer calc took 33800 us.
<snip>
Well...:                        1946,09839.
SIMD calc took 12619 us.
Well...:                        657,79602.
SIMD calc took 12674 us.
Well...:                        3669,88257.
SIMD calc took 12666 us.

Java:


package simd;

import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.DoubleBuffer;
import java.nio.FloatBuffer;

public class SimdTest {
	private static final int ITERS = 10;
	private static final int SIZE = 4*1024*1024;
    private static float[] dataA = new float[SIZE];
	private static float[] dataB = new float[SIZE];

	private static FloatBuffer bufferA = ByteBuffer.allocateDirect(SIZE*4).
										 order(ByteOrder.nativeOrder()).
										 asFloatBuffer();
	private static FloatBuffer bufferB = ByteBuffer.allocateDirect(SIZE*4).
	  									 order(ByteOrder.nativeOrder()).
	  									 asFloatBuffer();
	
	public static void main(String[] args) {
		
        // Test ARRAY based
        int random = (int) (SIZE*Math.random());
        for (int i = 0; i < ITERS; i++) {
			initArrays();
			long before = System.nanoTime();
			calcArrays();
			long after = System.nanoTime();
            System.out.printf("Well...:\t\t\t%4.5f.\n", dataA[random]);
			long diffInMs = (after - before) / 1000;
			System.out.printf("Array calc took %d us.\n", diffInMs);
		}
		
        // Test BUFFER based
		for (int i = 0; i < ITERS; i++) {
			initBuffers();
			long before = System.nanoTime();
			calcBuffers();
			long after = System.nanoTime();
            System.out.printf("Well...:\t\t\t%4.5f.\n", bufferA.get(random));
			long diffInMs = (after - before) / 1000;
			System.out.printf("Buffer calc took %d us.\n", diffInMs);
		}
		
        // Test SIMD based
        System.loadLibrary("testsimd");
        for (int i = 0; i < ITERS; i++) {
			initBuffers();
			long before = System.nanoTime();
			calcSIMD(bufferA, bufferB);
			long after = System.nanoTime();
			System.out.printf("Well...:\t\t\t%4.5f.\n", bufferA.get(random));
			long diffInMs = (after - before) / 1000;
			System.out.printf("SIMD calc took %d us.\n", diffInMs);
		}
	}

	private static void calcArrays() {
		for (int i = 0; i < SIZE; i++) {
			dataA[i] = dataA[i]*dataB[i];
		}
	}

	private static void calcBuffers() {
		for (int i = 0; i < SIZE; i++) {
			bufferA.put(i, bufferA.get(i)*bufferB.get(i));
		}
	}

	private static native void calcSIMD(FloatBuffer bufferA, FloatBuffer bufferB);
	
	private static void initArrays() {
		for (int i=0; i < SIZE; i++) {
            dataA[i] = (float) (Math.random() * 100);
            dataB[i] = (float) (Math.random() * 100);
		}
    }

	private static void initBuffers() {
		for (int i=0; i < SIZE; i++) {
            bufferA.put(i, (float) (Math.random() * 100));
            bufferB.put(i, (float) (Math.random() * 100));
		}
	}
}

C header:


/* DO NOT EDIT THIS FILE - it is machine generated */
#include <jni.h>
/* Header for class simd_SimdTest */

#ifndef _Included_simd_SimdTest
#define _Included_simd_SimdTest
#ifdef __cplusplus
extern "C" {
#endif
#undef simd_SimdTest_SIZE
#define simd_SimdTest_SIZE 409600L
/*
 * Class:     simd_SimdTest
 * Method:    calcSIMD
 * Signature: (Ljava/nio/DoubleBuffer;Ljava/nio/DoubleBuffer;)V
 */
JNIEXPORT void JNICALL Java_simd_SimdTest_calcSIMD
  (JNIEnv *, jclass, jobject, jobject);

#ifdef __cplusplus
}
#endif
#endif

C-impl:


#include <jni.h>
#include <stdio.h>
#include "simd_SimdTest.h"

//typedef float v4sf __attribute__ ((mode(V4SF)));

JNIEXPORT void JNICALL Java_simd_SimdTest_calcSIMD(JNIEnv *env, jclass cls, jobject ba, jobject bb)
{
    /*
    // Float vectors of size 4
    v4sf a;
    v4sf b;
    v4sf result;
    */
    
    float* bufferA = (float*) (*env)->GetDirectBufferAddress(env, ba);
    int sizeA = (*env)->GetDirectBufferCapacity(env, ba);
    float* bufferB = (float*) (*env)->GetDirectBufferAddress(env, bb);
    
    // assumes buffers is equal in size.
    int i = 0;
    for (; i < sizeA; i++) // i += 4
    {
        *bufferA = (*bufferA) * (*bufferB);
        bufferA++; bufferB++;
        
        /*
        // GCC 3.4 compiles above to same speed as the more
        // verbose version below.
        
        // Loads 4 floats from adress bufferA
        a = __builtin_ia32_loadups(bufferA); 
        b = __builtin_ia32_loadups(bufferB); 
        // a * b
        result = __builtin_ia32_mulps (a, b);   
        //Stores 4 floats at  adress bufferA
        __builtin_ia32_storeups(bufferA, result); 
        bufferA = bufferA + 4;
        bufferB = bufferB + 4;
        */
    }
    
}

Riven · February 6, 2006, 5:20pm

Thank you very much!

According to the results SIMD is 2.6x faster, which is really worth it, IMHO.

I think the more complex math is done in native code, the more the speedup will be (up to 4x?), like multiplying an array of vectors by one matrix, or lerping between 2 data-sets, writing to the 3rd (a * (1-w) + b * w).

Was it much work? Will it be much work to add other operations? If not, could you wrap quite a few basic math operations in a native lib (DLL / SO)?

zingbat · February 6, 2006, 11:29pm

That would be really great.

pepe · February 7, 2006, 10:02am

awww… errr.
:-X
I didn’t see the ‘buffer’ part of the names of your variables. Yes, we agree.

Riven · February 7, 2006, 11:31am

Would something like this work too? I’m not too sure I did the pointer-offsets right in the loop.

// cross
JNIEXPORT void JNICALL Java_simd_BufferMath_cross(JNIEnv *env, jclass cls, jobject a, jobject b, jobject c) {
    int size = (*env)->GetDirectBufferCapacity(env, a);

    float* pA = (float*) (*env)->GetDirectBufferAddress(env, a);
    float* pB = (float*) (*env)->GetDirectBufferAddress(env, b);
    float* pC = (float*) (*env)->GetDirectBufferAddress(env, c);
    
    size = size / 3;

    for (int i = 0; i < size; i++) {
       (*pC+0) = (*pA+1) * (*pB+2) - (*pA+2) * (*pB+1); // are these pointer-offsets valid?
       (*pC+1) = (*pA+2) * (*pB+0) - (*pA+0) * (*pB+2); // are these pointer-offsets valid?
       (*pC+2) = (*pA+0) * (*pB+1) - (*pA+1) * (*pB+0); // are these pointer-offsets valid?
       *pA = (*pA) + 3;
       *pB = (*pB) + 3;
       *pC = (*pC) + 3;
    }
}

Also it would be good to only pass direct pointers to the function, so that the java-code can figure out the position & limit of the FloatBuffer and take that into account before entering the loop.

JNIEXPORT void JNICALL Java_simd_BufferMath_cross(JNIEnv *env, jclass cls, jlong a, jlong b, jlong c, jint vectors) {
    int size = (int) vectors * 3;

    float* pA = (float*) a;
    float* pB = (float*) b;
    float* pC = (float*) c;

    for (int i = 0; i < size; i++) {
       (*pC+0) = (*pA+1) * (*pB+2) - (*pA+2) * (*pB+1); // are these pointers valid?
       (*pC+1) = (*pA+2) * (*pB+0) - (*pA+0) * (*pB+2); // are these pointers valid?
       (*pC+2) = (*pA+0) * (*pB+1) - (*pA+1) * (*pB+0); // are these pointers valid?
       *pA = (*pA) + 3;
       *pB = (*pB) + 3;
       *pC = (*pC) + 3;
    }
}

I think this was how LWJGL handled this scenario.

AndersDahlberg · February 7, 2006, 8:03pm

Hmm, I think that should be:


*(pC+0) = *(pA+1) * ...
*(pC+1) = *(pA+2) * ...
*(pC+2) = *(pA+0) * ...

I.e. IIRC the dereferencing operator * have higher priority than + (disclaimer: I’ve only read a course in uni, never done any real work in c/c++)

EDIT: I really see no added benefit (only “trouble” ) by passing direct pointers instead of direct bytebuffers - as the buffers should have adequate performance (and, I believe LWJGL now uses only direct bytebuffers as well).

EDIT2: I.e. you could use limit()/capacity()/etc at the java-side and still send bytebuffers + extra integer “size” parameters and not bother with using a pointer

Riven · February 7, 2006, 8:53pm

Yes, I feared that.

My idea was to have this at the java-side:


PUBLIC static final void mul(FloatBuffer a, FloatBuffer b, FloatBuffer c)
{
     int size = a.remaining(); // and check b, and c..
     mul(a, b, c, a.position() << 2, b.position() << 2, c.position() << 2, size);
}

PRIVATE static native void mul(FloatBuffer a, FloatBuffer b, FloatBuffer c, int aOff, int bOff, int cOff, int size);

For a moment I was considering using the Unsafe class to get the pointer at the java-side, but that call is done through reflection, which means a waste of performance, as it will get called very often.

This all means there will be quite a few method-parameters (one offset INT for every FloatBuffer, and 1 size) but that will be hidden for the java-side.

Sounds like a plan?

Raghar · February 7, 2006, 9:00pm

Why?
[/quote]


1.
for 0 buffer length 
mul a, b

for 0 buffer length 
add a, c


is significantly less efficient than
2.
for 0 buffer length
mul a, b
add a, c


and of course better way is
3.
for 0 buffer length += 4
mov128 XMM0, [pointer to a + ofset]
mulf128D32 XMM0, [pointer to b + ofset]
addf128D32 XMM0, [pointer to c + ofset]
mov128 [pointer to a + ofset], XMM0

pipe SIMDOperationMadInAcion(a,b,c, maxLength) {
pipe a float
pipe b float
pipe c float

a = a*b + c //or a *= b;  a += c;

} (float[] a, float[] b, float[] c, length)

Would be easily converted into 3rd case, 4 pipes at once. And of course it would have nice bonus to work better on 256 bit SIMD register. If CPU will not have any SIMD ability, it would fall back just to one pipe at once.

It might be actually more helpful to convert
for(int d = 0…){
a[d] = a[d] * b[d] + c[d]
}

into JIT recognized SIMD “intristic” with asumption that neither array was volatile, and enable this type of optimisation “automatically”.
Of course the above pipe syntax has advantage of being extremely clear to compiler. Obviously such analisys should be done at bytecode compilation time, so PIPE like hint to the compiler could be encoded into the .class file, and save some time of the JIT compiler.

I didn’t say compile SSE3 commands into binary, I said use SSE3 like syntax for something like “vertex program/GLSL”… Obviously Sparcs has it’s own version of SSE, and it’s not exactly difficult to convert from one restricted command set into another.

AndersDahlberg · February 7, 2006, 9:41pm

Regarding many parameters:
I believe the simd versions of math will only be relevant if the number of calculations is >> 1000 which seems to imply that even if you go crazy with number of extra checks you do at the java side / extra parameters you pass to native side, it will only have a slight impact on performance (and probably a big impact on increased usability!).

Hopefully all of this will soon ™ become irrelevant, by the work of our sun jit-gods ;). I agree though that it still could be a helpful library at the present time.

Riven · February 7, 2006, 10:14pm

Raghar, your idea is clean, and above all, it’s safe (from sandbox POV) but either it requires a change in the JIT and in the class-spec, or dynamic compilation of SSE code done by the app (requires a C compiler!). Those are not going to happen in the next few years, while the alternative is so incredibly trival, as AndersDahlberg showed us.

If I had to choose, I’d go for now, instead of waiting for Sun to make (more or less) complex changes to their JVM, released in Java 7 I think.

AndersDahlberg, would you like to invest some time in this? If you want to, I could write all Java-code and C-code, I just can’t manage to compile it as my compiler doesn’t like the header-files that are required for SSE/SIMD. You might have to fix some trival bugs, as I can’t test it, but the code is so simple that won’t be a problem.

I really hope we can get this thing off the ground 8)

Riven · February 10, 2006, 4:17pm

I’ll take that as a “no” then… ? :-\

Anybody else? :-*