My na(t)ive PerlinNoise impl. is 3 times slower than server VM

Roquen · August 20, 2010, 9:03am

HotSpot doesn’t generate any parallel computations, but it will use SSE instructions when available. You can grep in the source for supports_sse & supports_sse2. There are a number of instructs which can have a big impact, even when computing a single result at a time.

tom · August 20, 2010, 9:46am

One advantage with SSE is that x87 registers are stack based while the SSE registers directly addressable.

Riven · August 20, 2010, 10:33am

Isn’t code like _mm_mul_ss already SSE, or is that x87? The ASM dump is littered with instructions like MULSS.

tom · August 20, 2010, 11:46am

Didn’t look at the ASM dump before now :persecutioncomplex: Yes the code uses SSE. x87 instructions starts with f, like fmul, fadd etc.

Roquen · August 20, 2010, 11:57am

All of the mm* intrinsic functions are wrappers for SIMD opcodes (MMX on). And _mm_mul_ss is a SSE-1 instruction. As for the asm dump, you mean yours? If so, then well, you told it to use SSE-1!

@tom: reduce memory motion is part of the picture. In most cases the opcodes themselves will have a larger impact. Simple example:

Lowering f2i looks like this:


if (single_sse)
  emit(cvtsi2ss(..));   // single op convert reg to reg
else if (double_sse)
  emit(cvtsi2sd(..));   // single op convert reg to reg
else {
  emit(fldcw(..));   // BAM!! Change control word. wait til everything is done and change internal hardware state.
  emit(fist(...));     // store result to memory 
  emit(mov(...));   // load result to dst reg
  emit(fldcw(..));   // BAM!!  What? Again?  Darn gotta restore it.
}

Riven · August 20, 2010, 12:06pm

Well, we were trying to find out why the C code was slower than the JVM code (the math still is slower in C, I removed the branching bottleneck), and your and tom’s suggestion/insinuation was that the JVM uses SSE, but as the C version uses SSE too, it means it can’t be related to SSE. That was what confused me, so I double checked.

Anyway, thanks for your tips and suggestions, as it helped me getting rid of floor() and looking at the grad-function for alternative optimizations.

Roquen · August 23, 2010, 8:40am

I missed this reply. My “guess” is that you’re on a machine newer than a P3 and the JIT is using newer instructions. I would be very curious to see what the JIT is producing. I’ve never run across a case where the VM was producing faster code.

Riven · August 23, 2010, 2:05pm

[quote=“Roquen,post:27,topic:35575”]
This forum is ‘messed up’ – I miss a lot of replies too, I often have to manually check whether there are new replies… the RSS feed is kind of a solution…

[quote=“Roquen,post:27,topic:35575”]
I changed the “-march” to “pentium4” (I don’t know the names of modern archs) and it didn’t help. It could be that GCC is simply doing inefficient SSE: not reusing registers or converting intermediate SSE results to x87 when it inlines a method.

If I replace this code:


   inline float lerp(float t, float a, float b)
   {
      return a + t * (b - a);
   }

with (to my knowlegde) theoretically the same:


#define lerp(t, a, b) (a + t * (b - a))

the result is 10% slower, so GCC is not generating the same assembly code in this trivial example.

nonnus29 · August 25, 2010, 7:28pm

Have you tried running the C code natively and NOT via JNI and measured the time that way? I think what you are seeing is over head from using JNI and NOT C. JNI is notoriously slow.

Edit: never mind, you said that you did run the C version without JNI in the original post, my bad. This is indeed a surprising result!

Wildern · September 10, 2010, 11:01pm

You could try profiling the C version with gprof
http://www.cs.duke.edu/~ola/courses/programming/gprof.html

Riven · September 10, 2010, 11:06pm

“Basically, it looks into each of your functions and inserts code at the head and tail of each one to collect timing information”

That’s going to skew results so badly… perlin-noise is taking so few cycles that injecting timing-code in all (inlined) methods will render any outcome useless.

Wildern · September 10, 2010, 11:26pm

You had mentioned there was barely a difference between inline and not, I thought this was worth a shot as it is very simple/quick to try.

Riven · September 10, 2010, 11:31pm

That’s because it always gets inlined, as visible in the ASM code.

Roquen · September 11, 2010, 6:07am

Wildern: VTune (and similar) profilers are much more interesting.

The bigger question (in my mind) is why the JIT is do so well. There are tons of things riven could do if he wanted to improve the native performance.

(edit) And if he was really interested in high performance…this should be done on the GPU.

Riven · September 13, 2010, 7:42am

Right, care to provide some insight instead of vague hand waving?

Everything you suggested was either just as fast or much slower. My own optimization doubled the performance.

Roquen · September 13, 2010, 8:19am

Use SIMD. Branch elimination. Add a static init method that determines hardware arch. Use closest arch match for computation.

(edit) And use a better compiler.

Riven · September 13, 2010, 11:28am

Thanks. I’m not too confident about using SIMD in perlin noise, as it’s not that easy to vectorize at it branches everywhere. From what I read about other people’s attempts, the SIMD code is just as fast, until they add _mm_prefetch() which makes is significantly faster.

One question: if I pass the compiler the arch of the CPU I’m running, I expect that the compiler is generating the ‘fastest possible’ code for that CPU (just as fast as writing code for that specific arch). So if passing a (more recent) arch (flag) to the compiler doesn’t result in fast code, would it be safe to assume that writing specific code for that arch won’t change the performance either?

I’m looking at those free versions Visual Studio, and will post the result. I’m not expecting much, but who knows.

CommanderKeith · September 13, 2010, 11:45am

Admirable persistence. So you’re trying to replicate the performance using native code to beat or at least come close to the server VM? Pretty cool.

i never knew how many C experts there were here.

This seems like a good lesson for me in why I should not bother learning C. So much hassle, and it’s still slower.

Roquen · September 13, 2010, 12:16pm

I’ve been meaning to mention this, but keep forgetting. Changing the floor call to truncation will cause defects (if any of the inputs are negative). This should be choosing the lower coordinate of cube (cell) which contains the evaluation point. You might what to try using this instead and then check the results (vs. performance):


  gx = (x > 0 ? (int)x : (int)(x-1)) & 0xff; // etc for gy,gz

(edit: Of course this is introducing 3 if’s. Leaving the code as-is is correct as long as the input space is limited to positive.)

And another complete aside: Perlin presented a replacement for gradient noise, called simplex noise, sometime around 2000/2001 which has less defects and is faster (although maybe not in low dimensions).

I’m not familiar with the more recent version of GCC. If memory serves there are multiple sets of options related to CPU specific code gen (as opposed to general optimization). The first set basically specifies the allowed set of opcodes to use and the second is the specific hardware to target for coding scheduling. So, for instance, you could only allow opcodes present in the P3, but set the scheduling for a newer processor. I’m pretty sure that they did add a basic switch that says “do the best you can for this machine” switch.

For SIMD, in addition to prefetching, since you’re array processing you could do non-temporal stores. Well, actually you could do non-temporal stores in the scalar code…just means directly writing _mm_XXX instead of the compiler doing it for you.

@CommaderKeith: If only this was more common. In theory runtime compiled code could virtually always be faster. (Go LLVM & Shark!)

Riven · September 13, 2010, 12:26pm

The current C code is faster, but only because it uses another strategy (switch-table in C for grad(), which is not as fast in Java). Besides that, maybe GCC is just producing inefficient machinecode.