My na(t)ive PerlinNoise impl. is 3 times slower than server VM

Riven · September 10, 2010, 11:06pm

“Basically, it looks into each of your functions and inserts code at the head and tail of each one to collect timing information”

That’s going to skew results so badly… perlin-noise is taking so few cycles that injecting timing-code in all (inlined) methods will render any outcome useless.

Wildern · September 10, 2010, 11:26pm

You had mentioned there was barely a difference between inline and not, I thought this was worth a shot as it is very simple/quick to try.

Riven · September 10, 2010, 11:31pm

That’s because it always gets inlined, as visible in the ASM code.

Roquen · September 11, 2010, 6:07am

Wildern: VTune (and similar) profilers are much more interesting.

The bigger question (in my mind) is why the JIT is do so well. There are tons of things riven could do if he wanted to improve the native performance.

(edit) And if he was really interested in high performance…this should be done on the GPU.

Riven · September 13, 2010, 7:42am

Right, care to provide some insight instead of vague hand waving?

Everything you suggested was either just as fast or much slower. My own optimization doubled the performance.

Roquen · September 13, 2010, 8:19am

Use SIMD. Branch elimination. Add a static init method that determines hardware arch. Use closest arch match for computation.

(edit) And use a better compiler.

Riven · September 13, 2010, 11:28am

Thanks. I’m not too confident about using SIMD in perlin noise, as it’s not that easy to vectorize at it branches everywhere. From what I read about other people’s attempts, the SIMD code is just as fast, until they add _mm_prefetch() which makes is significantly faster.

One question: if I pass the compiler the arch of the CPU I’m running, I expect that the compiler is generating the ‘fastest possible’ code for that CPU (just as fast as writing code for that specific arch). So if passing a (more recent) arch (flag) to the compiler doesn’t result in fast code, would it be safe to assume that writing specific code for that arch won’t change the performance either?

I’m looking at those free versions Visual Studio, and will post the result. I’m not expecting much, but who knows.

CommanderKeith · September 13, 2010, 11:45am

Admirable persistence. So you’re trying to replicate the performance using native code to beat or at least come close to the server VM? Pretty cool.

i never knew how many C experts there were here.

This seems like a good lesson for me in why I should not bother learning C. So much hassle, and it’s still slower.

Roquen · September 13, 2010, 12:16pm

I’ve been meaning to mention this, but keep forgetting. Changing the floor call to truncation will cause defects (if any of the inputs are negative). This should be choosing the lower coordinate of cube (cell) which contains the evaluation point. You might what to try using this instead and then check the results (vs. performance):


  gx = (x > 0 ? (int)x : (int)(x-1)) & 0xff; // etc for gy,gz

(edit: Of course this is introducing 3 if’s. Leaving the code as-is is correct as long as the input space is limited to positive.)

And another complete aside: Perlin presented a replacement for gradient noise, called simplex noise, sometime around 2000/2001 which has less defects and is faster (although maybe not in low dimensions).

I’m not familiar with the more recent version of GCC. If memory serves there are multiple sets of options related to CPU specific code gen (as opposed to general optimization). The first set basically specifies the allowed set of opcodes to use and the second is the specific hardware to target for coding scheduling. So, for instance, you could only allow opcodes present in the P3, but set the scheduling for a newer processor. I’m pretty sure that they did add a basic switch that says “do the best you can for this machine” switch.

For SIMD, in addition to prefetching, since you’re array processing you could do non-temporal stores. Well, actually you could do non-temporal stores in the scalar code…just means directly writing _mm_XXX instead of the compiler doing it for you.

@CommaderKeith: If only this was more common. In theory runtime compiled code could virtually always be faster. (Go LLVM & Shark!)

Riven · September 13, 2010, 12:26pm

The current C code is faster, but only because it uses another strategy (switch-table in C for grad(), which is not as fast in Java). Besides that, maybe GCC is just producing inefficient machinecode.

delt0r · September 13, 2010, 1:32pm

I really don’t know why C has a reputation of being fast (Fortan is Faster :P). It can be after a lot of work. But out of the box its a rare thing. Doing asm is not using C. Sure it can also be fast, but at what labor cost? Basically with the compilation unit stuck at the file+macro level and poor reference semantics, the compiler simply can’t optimize to the same level. A little runtime data can make some optimizations easy (ie not np hard), and more importantly the “compilation unit” for the jvm is the “fully linked code”.

In my experience gcc is not bad. Intels Compiler does produce faster code on intels and I try and get the department to shell out for it for the intel clusters we have. But i wouldn’t expect double performance or anything usually more like 20-50%.

C has some pretty big advantages if you need to have some asm blocks etc. Case in point there are quite a few highly optimized SSE etc asm implementations of perlin noise.

Roquen · September 29, 2010, 8:33am

Ran across Perlin’s “Improved Noise” article while looking something up and noticed that there is an error in the gradient computation (did not check if the conditional move version is correct). Correction (explicit vectors listed in Section 3):


  switch(hash & 0xF) {
    case 0x0: return  x+y;
    case 0x1: return -x+y;
    case 0x2: return  x-y;
    case 0x3: return -x-y;
    case 0x4: return  x+z;
    case 0x5: return -x+z;
    case 0x6: return  x-z;
    case 0x7: return -x-z;
    case 0x8: return  y+z;
    case 0x9: return -y+z;
    case 0xA: return  y-z;
    case 0xB: return -y-z;
    case 0xC: return  x+y;
    case 0xD: return -x+y;
    case 0xE: return -y+z;
    case 0xF: return -y-z;

    default:
      // never happens
      return 0;
  }

Riven · October 15, 2010, 9:46am

Thanks for the patch!

Riven · November 24, 2010, 4:53pm

:-\

Your ‘patch’ had a bunch of bugs…

This one is correct:


float grad(int hash, float x, float y, float z)
   {  
	//float u = (h < 8) ? x : y;
	//float v = (h < 4) ? y : ((h == 12 || h == 14) ? x : z);
	//return ((h & 1) == 0 ? u : -u) + ((h & 2) == 0 ? v : -v);

	switch(hash & 0xF)
	{
		case 0x0: return  x + y;
		case 0x1: return -x + y;
		case 0x2: return  x - y;
		case 0x3: return -x - y;
		case 0x4: return  x + z;
		case 0x5: return -x + z;
		case 0x6: return  x - z;
		case 0x7: return -x - z;
		case 0x8: return  y + z;
		case 0x9: return -y + z;
		case 0xA: return  y - z;
		case 0xB: return -y - z;
		case 0xC: return  y + x;
		case 0xD: return -y + z;
		case 0xE: return  y - x;
		case 0xF: return -y - z;
		default: return 0; // never happens
	}
   }

Roquen · November 24, 2010, 6:15pm

I rechecked it and “looks” clean to me. This is a direct translation of the vectors given in Perlin’s paper that I linked above (in the noted section) and not a expansion of the “if” version.

Jono · November 26, 2010, 9:49am

I haven’t read the paper and don’t know the algorithm for Perlin noise, but as a gambling man I’ll put money on Riven for the following reason: in the version he put up each of x,y and z appears 5 times as a positive value and 5 times as negative. In the one you posted y is negative 6 times and positive only 4, which makes me feel that the resulting distribution is probably not uniform.

Riven · November 26, 2010, 10:14am

I viewed the document.

I think it’s safe to say that the “new gradient distributions” suffer from nasty patterns: you get a lot of straight lines (in the sample render I see 5) and rectangular shapes.

IMHO it is definitely not an improvement, so I’ll stick to the “old gradient distributions” which is admittedly less random, but more irregular.

Nate · November 27, 2010, 8:59am

If speed is important, is simplex noise any better? Roquen mentioned it above.

Roquen · November 27, 2010, 3:21pm

Riven’s new version is the correct and is the new gradient set. The only difference between the two is the cases 0xD & 0xE are swapped, which has no effect in theory, so there is no mathematical reason to prefer one over the other.

Another way to squeeze out some extra cycles would be to use the older ease (fade) formula, which shouldn’t produce noticable defects for simple usages. The defects of the old ease function become more apparent in higher dimensions: either noise itself or usage…see the paper above figures 1a & 1b. The problem with the older fade is that it is only first order continous where the newer one is in all dimensions. So you probably won’t notice any defects if noise is simply being used to generate textures, but probably will if generating terrain (for example)…of course “beauty is in the eye of the beholder”, or in this case “the developer”. For reference the older ease is:


(t*t*(3.f-(t+t)));

Nate: In theory simplex noise in three and higher dimensions should be faster than gradient noise. There are a couple of issues though. The first is that the “look” of simplex noise isn’t the same, which could be a problem (see here for some examples). The second is I that know of no good implemenations (either for reference or direct usage). Perlin’s original design is targeting hardware and his reference implementation reflects that. A software version should be significantly different to be “fast”. There is a paper “Simplex noise demystified” is reasonable for information, but the author’s implementation is very questionable IMHO. Specifically he falls back to using permutation tables and actually performs dot products with explict vectors (stored in tables), which is missing part of the point of simplex noise given modern hardware. (To be fair, the author states the code is more for clarity than speed) So simplex noise would require a little blood, sweat and tears (or maybe a lot).

(EDIT:)
To state the obvious to any of our dear readers: Always use the lowest dimensional noise you need to perform a given task.

badlogicgames · January 3, 2011, 10:24pm

Don’t use O3, O2 is often times faster. GCC’s vectorizing optimizer is a bit shitty as well. I had a similar issue with some CPU skinning code. I compile my native code with this at the moment:

-c -Wall -O2 -mfpmath=sse -msse2

The skinning code is now around 10% faster than the equivalent Java code (all manually inlined, all plain old data types etc. etc.).

If you want really good performance use the Intel compiler as suggested earlier. It will make a huge difference.