SSE with GCC vs Java Server VM

Riven · December 16, 2008, 4:33pm

I have this simple C sourcecode, that is supposed to be
faster than Java, as the Java VM can’t do SIMD yet.

The problem is that the non-SIMD code in C (1100ms),
is faster than the SIMD code in C (1750ms).
And both are beaten by the Java Server VM (750ms)

Now, the JVM can’t possibly be faster, as the C version
is supposed to do 2-4x as much in every operation
(2x on P3/P4, 4x on C2D/C2Q).

So, where am I screwing up?
Even in C the SSE version is slower…

C initialization code:


#include <stdio.h>
#include <xmmintrin.h>
#include <time.h>
#include <sys/time.h>
#include <errno.h>
#include <windows.h>

__m128 a, b, c;

float f4a[4] __attribute__((aligned(16))) = { +1.2, +3.5, +1.7, +2.8 };
float f4b[4] __attribute__((aligned(16))) = { -0.7, +2.6, +3.3, -4.0 };
float f4c[4] __attribute__((aligned(16))) = { -0.7, +2.6, +3.3, -4.0 };

unsigned long long System_currentTimeMillis() {
    FILETIME t;
    long long c;
    GetSystemTimeAsFileTime(&t);
    c = (unsigned long long int) t.dwHighDateTime << 32LL;
    return (c | t.dwLowDateTime) / 10000;
}

Normal (x86) C code: (takes 1100ms)


	t0 = System_currentTimeMillis();
	for(i = 0; i < end; i++)
	{
		f4c[0] = f4a[0] + f4b[0];
		f4c[1] = f4a[1] + f4b[1];
		f4c[2] = f4a[2] + f4b[2];
		f4c[3] = f4a[3] + f4b[3];

		f4a[0] = f4c[0] - f4b[0];
		f4a[1] = f4c[1] - f4b[1];
		f4a[2] = f4c[2] - f4b[2];
		f4a[3] = f4c[3] - f4b[3];

		f4c[0] = f4a[0] * f4c[0];
		f4c[1] = f4a[1] * f4c[1];
		f4c[2] = f4a[2] * f4c[2];
		f4c[3] = f4a[3] * f4c[3];
	}
	t1 = System_currentTimeMillis();
	printf("x86 took: %dms\n", (int)(t1-t0));

SIMD SSE2 code: (takes 1750ms)


	t0 = System_currentTimeMillis();
	a = _mm_load_ps(f4a);
	b = _mm_load_ps(f4b);
	c = _mm_load_ps(f4c);
	for(i = 0; i < end; i++)
	{
		c = _mm_add_ps(a, b);
		a = _mm_sub_ps(c, b);
		c = _mm_mul_ps(a, c);
	}
	_mm_store_ps(f4c,c);
	t1 = System_currentTimeMillis();
	printf("SSE took: %dms\n", (int)(t1-t0));

Java code: (takes 750ms)


   static void _mm_mul_ps(float[] a, float[] b, float[] dst)
   {
      dst[0] = a[0] * b[0];
      dst[1] = a[1] * b[1];
      dst[2] = a[2] * b[2];
      dst[3] = a[3] * b[3];
   }

   static void _mm_add_ps(float[] a, float[] b, float[] dst) { .......... }
   static void _mm_sub_ps(float[] a, float[] b, float[] dst) { .......... }

   static float[] run()
   {
      float[] a = { 1.2f, 3.5f, 1.7f, 2.8f };
      float[] b = { -0.7f, 2.6f, 3.3f, -4.0f };
      float[] c = { -0.7f, 2.6f, 3.3f, -4.0f };

      int end = 1024 * 1024 * 64;

      for (int i = 0; i < end; i++)
      {
         _mm_add_ps(a, b, c);
         _mm_sub_ps(c, b, a);
         _mm_mul_ps(a, c, c);
      }
      
      return c;
   }

I am compiling with:
M:\MinGW_C_compiler\bin\gcc -Wall -Wl,-subsystem,console -march=pentium3 -mfpmath=sse -fomit-frame-pointer -funroll-loops sse.c -o “sse.exe”

Orangy_Tang · December 16, 2008, 4:57pm

I havn’t used gcc command line for ages, but you don’t appear to be compiling with optimisations? Try sticking in a -O3 (max optimisations) arg.

Without -O3 functions won’t be inlined and your SSE version would appear to suffer more without that.

Riven · December 16, 2008, 5:03pm

:persecutioncomplex: Damn! Silly me!

Java took: 750ms
C x86 took: 484ms
C SSE took: 297ms

Thanks for that!

bienator · December 16, 2008, 5:30pm

seriously, i expected that java would be far slower compared to pure SSE instructions.

now try this testcase on a GPU for comparison (estimation: at least 20x faster as SSE on a mainstream card)

princec · December 16, 2008, 5:56pm

What x86 code does the Java compiler emit? Be interesting to see where it could be optimised. You could give this little benchmark to the VM team. (I assume there’s some use for this particular bit of code somewhere?)

Cas

erikd · December 16, 2008, 7:32pm

I’d also be interested what GCJ would do with and without bounds checks disabled.

Riven · December 16, 2008, 7:48pm

Mind you that this was run on a P4, and run with the Server VM of Java 1.6.

On a P4 @ 2.4GHz:ClientVM took: 3840ms ServerVM took: 750ms C x86 took: 484ms C SSE took: 297ms

On a Q6600 @ 2.4GHz:ClientVM took: 3350ms ServerVM took: 650ms C x86 took: 328ms C SSE took: 200ms

Riven · December 16, 2008, 8:02pm

IIRC that requires the Debug JDK. I don’t have it, and it’s quite a husle to get the assambler code out of it. And then I never did any ASM so I could only post it here - somebody with the Debug JDK should.

Anyway, the VM simply cannot use SIMD, as the float[] is not guaranteed the be aligned on 16 bytes, and if it is, we’re dealing with offsets in the float[] that must be multiples of 4, AND have at least 3 more elements, AND you’d have to execute the same instructions on all 4. Pretty heavy stuff for the VM to figure out.

Further, I already filed an RFE about manual SIMD (a library) in the bugparade, but it was closed, mentioning ‘the JVM should be able to make this optimization itself’ - well, I guess that’s not going to happen in the next 10 years.

Last but not least, if you compare C x86 vs. ServerVM, I guess GCC uses SIMD behind the scenes…, so it wouldn’t be a fair comparison.

bienator · December 16, 2008, 9:09pm

ironically they implemented behind the scenes a SSE CPU pipeline for Decora the backend for JavaFX’ graphical gimmicks.

Riven · December 16, 2008, 9:15pm

Yeah, ironically… but that’s very specialized stuff - decoding video, nothing like turning your average bytecode that may just as well be decoding XML and releasing a SSE-enabled JIT on it. I’m really sure we won’t see stuff like this any time soon, it’s just too hard with too little gain, only vector-math takes advantage of it (while webserver- and database performance is where all the money is at), and the vector code there is normally so small that you can write some native lib for it.

GKW · December 17, 2008, 12:07am

Have you tried 1.7? Supposedly it can generate SIMD code. I’m not sure if you have to use a command line flag to turn it on.

VeaR · December 17, 2008, 9:17am

If a C compiler can optimize to SIMD, Java should be able to do it even more easily.

princec · December 17, 2008, 11:34am

You might notice rather a lot of very specific and fiddly looking flags and macros that you need to do to allow SIMD optimisations in C. In other words the C compiler isn’t really automatically creating SIMD code at all, you’re giving it a ton of specific hints (eg

float f4a[4] __attribute__((aligned(16)))

)

…in which case Java will be needing similar hints, and is therefore probably better off as Riven reckons with a library to deal with it explicitly.

Cas

Riven · December 17, 2008, 5:52pm

I just ran the (extremely simplistic) micro benchmark on non-debug JDK 1.7 EA, and I didn’t see any performance improvement. I think this simple piece of code doesn’t show any of the performance improvements in the VM.

trembovetski · December 17, 2008, 10:35pm

At the risk of stating the obvious :persecutioncomplex:, you did warm up the VM prior to taking the measurement right?

Dmitri

VeaR · December 18, 2008, 12:26am

Array objects are aligned at 8 bytes, but it also means that the first element in the array is never aligned at 16 bytes. Ahh well.

What kind of library is it?

For example this

http://www.javaworld.com/javaworld/jw-12-2008/jw-12-year-in-review-2.html?page=5

kind of API could be built around a SIMD library, or even built into the ParallelArray. It would nicely hide any technicalities, which is necessary for any RFE to go trough.

Riven · December 18, 2008, 8:23am

Yup, it runs the benchmark 16 times and prints how long each run took.

But again, this is a very simplistic benchmark, only doing + - * on the same float[]s

GKW · December 18, 2008, 2:47pm

If I’m reading the source correctly I think if you use the 1.7 debug build and use the flag -XX:TraceSuperWord then you can see what hotspot is doing behind the scene. If I find some time later today I might give that a try.

Riven · December 18, 2008, 8:22pm

The DebugVM doesn’t recognize that paramater and refuses to launch.

Removing that parameter, and just running the VM instantly crashes with a nasty native crash.

GKW · December 18, 2008, 8:44pm

I guess I had alittle better luck than you. The command line flag is actually -XX:+TraceSuperWord. I left out the plus before. I got a dump from the SuperWord opto. I don’t really have any idea what it did but the JVM seems to recognize that it should generate SIMD instructions for that pattern. I’d like to take a look at the JIT’ed code but I’ve never done that before. I tried running the program with UseSSE=0 and UseSSE=3 adn there was a whole 2 seconds difference between them, 618(no sse) to 616(sse3). Small enough to be noise I would think.