SSE with GCC vs Java Server VM

I have this simple C sourcecode, that is supposed to be
faster than Java, as the Java VM can’t do SIMD yet.

The problem is that the non-SIMD code in C (1100ms),
is faster than the SIMD code in C (1750ms).
And both are beaten by the Java Server VM (750ms)

Now, the JVM can’t possibly be faster, as the C version
is supposed to do 2-4x as much in every operation
(2x on P3/P4, 4x on C2D/C2Q).

So, where am I screwing up?
Even in C the SSE version is slower…


C initialization code:


#include <stdio.h>
#include <xmmintrin.h>
#include <time.h>
#include <sys/time.h>
#include <errno.h>
#include <windows.h>

__m128 a, b, c;

float f4a[4] __attribute__((aligned(16))) = { +1.2, +3.5, +1.7, +2.8 };
float f4b[4] __attribute__((aligned(16))) = { -0.7, +2.6, +3.3, -4.0 };
float f4c[4] __attribute__((aligned(16))) = { -0.7, +2.6, +3.3, -4.0 };

unsigned long long System_currentTimeMillis() {
    FILETIME t;
    long long c;
    GetSystemTimeAsFileTime(&t);
    c = (unsigned long long int) t.dwHighDateTime << 32LL;
    return (c | t.dwLowDateTime) / 10000;
}

Normal (x86) C code: (takes 1100ms)


	t0 = System_currentTimeMillis();
	for(i = 0; i < end; i++)
	{
		f4c[0] = f4a[0] + f4b[0];
		f4c[1] = f4a[1] + f4b[1];
		f4c[2] = f4a[2] + f4b[2];
		f4c[3] = f4a[3] + f4b[3];

		f4a[0] = f4c[0] - f4b[0];
		f4a[1] = f4c[1] - f4b[1];
		f4a[2] = f4c[2] - f4b[2];
		f4a[3] = f4c[3] - f4b[3];

		f4c[0] = f4a[0] * f4c[0];
		f4c[1] = f4a[1] * f4c[1];
		f4c[2] = f4a[2] * f4c[2];
		f4c[3] = f4a[3] * f4c[3];
	}
	t1 = System_currentTimeMillis();
	printf("x86 took: %dms\n", (int)(t1-t0));

SIMD SSE2 code: (takes 1750ms)


	t0 = System_currentTimeMillis();
	a = _mm_load_ps(f4a);
	b = _mm_load_ps(f4b);
	c = _mm_load_ps(f4c);
	for(i = 0; i < end; i++)
	{
		c = _mm_add_ps(a, b);
		a = _mm_sub_ps(c, b);
		c = _mm_mul_ps(a, c);
	}
	_mm_store_ps(f4c,c);
	t1 = System_currentTimeMillis();
	printf("SSE took: %dms\n", (int)(t1-t0));

Java code: (takes 750ms)


   static void _mm_mul_ps(float[] a, float[] b, float[] dst)
   {
      dst[0] = a[0] * b[0];
      dst[1] = a[1] * b[1];
      dst[2] = a[2] * b[2];
      dst[3] = a[3] * b[3];
   }

   static void _mm_add_ps(float[] a, float[] b, float[] dst) { .......... }
   static void _mm_sub_ps(float[] a, float[] b, float[] dst) { .......... }

   static float[] run()
   {
      float[] a = { 1.2f, 3.5f, 1.7f, 2.8f };
      float[] b = { -0.7f, 2.6f, 3.3f, -4.0f };
      float[] c = { -0.7f, 2.6f, 3.3f, -4.0f };

      int end = 1024 * 1024 * 64;

      for (int i = 0; i < end; i++)
      {
         _mm_add_ps(a, b, c);
         _mm_sub_ps(c, b, a);
         _mm_mul_ps(a, c, c);
      }
      
      return c;
   }

I am compiling with:
M:\MinGW_C_compiler\bin\gcc -Wall -Wl,-subsystem,console -march=pentium3 -mfpmath=sse -fomit-frame-pointer -funroll-loops sse.c -o “sse.exe”

I havn’t used gcc command line for ages, but you don’t appear to be compiling with optimisations? Try sticking in a -O3 (max optimisations) arg.

Without -O3 functions won’t be inlined and your SSE version would appear to suffer more without that.

:persecutioncomplex: Damn! Silly me!

Java took: 750ms
C x86 took: 484ms
C SSE took: 297ms

Thanks for that!

seriously, i expected that java would be far slower compared to pure SSE instructions.

now try this testcase on a GPU for comparison :wink: (estimation: at least 20x faster as SSE on a mainstream card)

What x86 code does the Java compiler emit? Be interesting to see where it could be optimised. You could give this little benchmark to the VM team. (I assume there’s some use for this particular bit of code somewhere?)

Cas :slight_smile:

I’d also be interested what GCJ would do with and without bounds checks disabled. :slight_smile:

Mind you that this was run on a P4, and run with the Server VM of Java 1.6.

On a P4 @ 2.4GHz:ClientVM took: 3840ms ServerVM took: 750ms C x86 took: 484ms C SSE took: 297ms

On a Q6600 @ 2.4GHz:ClientVM took: 3350ms ServerVM took: 650ms C x86 took: 328ms C SSE took: 200ms

IIRC that requires the Debug JDK. I don’t have it, and it’s quite a husle to get the assambler code out of it. And then I never did any ASM so I could only post it here - somebody with the Debug JDK should.

Anyway, the VM simply cannot use SIMD, as the float[] is not guaranteed the be aligned on 16 bytes, and if it is, we’re dealing with offsets in the float[] that must be multiples of 4, AND have at least 3 more elements, AND you’d have to execute the same instructions on all 4. Pretty heavy stuff for the VM to figure out.

Further, I already filed an RFE about manual SIMD (a library) in the bugparade, but it was closed, mentioning ‘the JVM should be able to make this optimization itself’ - well, I guess that’s not going to happen in the next 10 years.

Last but not least, if you compare C x86 vs. ServerVM, I guess GCC uses SIMD behind the scenes…, so it wouldn’t be a fair comparison.

ironically they implemented behind the scenes a SSE CPU pipeline for Decora the backend for JavaFX’ graphical gimmicks.

Yeah, ironically… but that’s very specialized stuff - decoding video, nothing like turning your average bytecode that may just as well be decoding XML and releasing a SSE-enabled JIT on it. I’m really sure we won’t see stuff like this any time soon, it’s just too hard with too little gain, only vector-math takes advantage of it (while webserver- and database performance is where all the money is at), and the vector code there is normally so small that you can write some native lib for it.

Have you tried 1.7? Supposedly it can generate SIMD code. I’m not sure if you have to use a command line flag to turn it on.

If a C compiler can optimize to SIMD, Java should be able to do it even more easily.

You might notice rather a lot of very specific and fiddly looking flags and macros that you need to do to allow SIMD optimisations in C. In other words the C compiler isn’t really automatically creating SIMD code at all, you’re giving it a ton of specific hints (eg

float f4a[4] __attribute__((aligned(16)))

)

…in which case Java will be needing similar hints, and is therefore probably better off as Riven reckons with a library to deal with it explicitly.

Cas :slight_smile:

I just ran the (extremely simplistic) micro benchmark on non-debug JDK 1.7 EA, and I didn’t see any performance improvement. I think this simple piece of code doesn’t show any of the performance improvements in the VM.

At the risk of stating the obvious :persecutioncomplex:, you did warm up the VM prior to taking the measurement right?

Dmitri

Array objects are aligned at 8 bytes, but it also means that the first element in the array is never aligned at 16 bytes. Ahh well.

What kind of library is it?

For example this

http://www.javaworld.com/javaworld/jw-12-2008/jw-12-year-in-review-2.html?page=5

kind of API could be built around a SIMD library, or even built into the ParallelArray. It would nicely hide any technicalities, which is necessary for any RFE to go trough. :stuck_out_tongue:

Yup, it runs the benchmark 16 times and prints how long each run took.

But again, this is a very simplistic benchmark, only doing + - * on the same float[]s

If I’m reading the source correctly I think if you use the 1.7 debug build and use the flag -XX:TraceSuperWord then you can see what hotspot is doing behind the scene. If I find some time later today I might give that a try.

The DebugVM doesn’t recognize that paramater and refuses to launch.

Removing that parameter, and just running the VM instantly crashes with a nasty native crash. :slight_smile:

I guess I had alittle better luck than you. The command line flag is actually -XX:+TraceSuperWord. I left out the plus before. I got a dump from the SuperWord opto. I don’t really have any idea what it did but the JVM seems to recognize that it should generate SIMD instructions for that pattern. I’d like to take a look at the JIT’ed code but I’ve never done that before. I tried running the program with UseSSE=0 and UseSSE=3 adn there was a whole 2 seconds difference between them, 618(no sse) to 616(sse3). Small enough to be noise I would think.