My na(t)ive PerlinNoise impl. is 3 times slower than server VM

I really don’t know why C has a reputation of being fast (Fortan is Faster :P). It can be after a lot of work. But out of the box its a rare thing. Doing asm is not using C. Sure it can also be fast, but at what labor cost? Basically with the compilation unit stuck at the file+macro level and poor reference semantics, the compiler simply can’t optimize to the same level. A little runtime data can make some optimizations easy (ie not np hard), and more importantly the “compilation unit” for the jvm is the “fully linked code”.

In my experience gcc is not bad. Intels Compiler does produce faster code on intels and I try and get the department to shell out for it for the intel clusters we have. But i wouldn’t expect double performance or anything usually more like 20-50%.

C has some pretty big advantages if you need to have some asm blocks etc. Case in point there are quite a few highly optimized SSE etc asm implementations of perlin noise.

Ran across Perlin’s “Improved Noise” article while looking something up and noticed that there is an error in the gradient computation (did not check if the conditional move version is correct). Correction (explicit vectors listed in Section 3):


  switch(hash & 0xF) {
    case 0x0: return  x+y;
    case 0x1: return -x+y;
    case 0x2: return  x-y;
    case 0x3: return -x-y;
    case 0x4: return  x+z;
    case 0x5: return -x+z;
    case 0x6: return  x-z;
    case 0x7: return -x-z;
    case 0x8: return  y+z;
    case 0x9: return -y+z;
    case 0xA: return  y-z;
    case 0xB: return -y-z;
    case 0xC: return  x+y;
    case 0xD: return -x+y;
    case 0xE: return -y+z;
    case 0xF: return -y-z;

    default:
      // never happens
      return 0;
  }


Thanks for the patch!

:-\

Your ‘patch’ had a bunch of bugs…

This one is correct:


float grad(int hash, float x, float y, float z)
   {  
	//float u = (h < 8) ? x : y;
	//float v = (h < 4) ? y : ((h == 12 || h == 14) ? x : z);
	//return ((h & 1) == 0 ? u : -u) + ((h & 2) == 0 ? v : -v);

	switch(hash & 0xF)
	{
		case 0x0: return  x + y;
		case 0x1: return -x + y;
		case 0x2: return  x - y;
		case 0x3: return -x - y;
		case 0x4: return  x + z;
		case 0x5: return -x + z;
		case 0x6: return  x - z;
		case 0x7: return -x - z;
		case 0x8: return  y + z;
		case 0x9: return -y + z;
		case 0xA: return  y - z;
		case 0xB: return -y - z;
		case 0xC: return  y + x;
		case 0xD: return -y + z;
		case 0xE: return  y - x;
		case 0xF: return -y - z;
		default: return 0; // never happens
	}
   }

I rechecked it and “looks” clean to me. This is a direct translation of the vectors given in Perlin’s paper that I linked above (in the noted section) and not a expansion of the “if” version.

I haven’t read the paper and don’t know the algorithm for Perlin noise, but as a gambling man I’ll put money on Riven for the following reason: in the version he put up each of x,y and z appears 5 times as a positive value and 5 times as negative. In the one you posted y is negative 6 times and positive only 4, which makes me feel that the resulting distribution is probably not uniform.

I viewed the document.

I think it’s safe to say that the “new gradient distributions” suffer from nasty patterns: you get a lot of straight lines (in the sample render I see 5) and rectangular shapes.

IMHO it is definitely not an improvement, so I’ll stick to the “old gradient distributions” which is admittedly less random, but more irregular.

If speed is important, is simplex noise any better? Roquen mentioned it above.

Riven’s new version is the correct and is the new gradient set. The only difference between the two is the cases 0xD & 0xE are swapped, which has no effect in theory, so there is no mathematical reason to prefer one over the other.

Another way to squeeze out some extra cycles would be to use the older ease (fade) formula, which shouldn’t produce noticable defects for simple usages. The defects of the old ease function become more apparent in higher dimensions: either noise itself or usage…see the paper above figures 1a & 1b. The problem with the older fade is that it is only first order continous where the newer one is in all dimensions. So you probably won’t notice any defects if noise is simply being used to generate textures, but probably will if generating terrain (for example)…of course “beauty is in the eye of the beholder”, or in this case “the developer”. For reference the older ease is:


(t*t*(3.f-(t+t)));

Nate: In theory simplex noise in three and higher dimensions should be faster than gradient noise. There are a couple of issues though. The first is that the “look” of simplex noise isn’t the same, which could be a problem (see here for some examples). The second is I that know of no good implemenations (either for reference or direct usage). Perlin’s original design is targeting hardware and his reference implementation reflects that. A software version should be significantly different to be “fast”. There is a paper “Simplex noise demystified” is reasonable for information, but the author’s implementation is very questionable IMHO. Specifically he falls back to using permutation tables and actually performs dot products with explict vectors (stored in tables), which is missing part of the point of simplex noise given modern hardware. (To be fair, the author states the code is more for clarity than speed) So simplex noise would require a little blood, sweat and tears (or maybe a lot).

(EDIT:)
To state the obvious to any of our dear readers: Always use the lowest dimensional noise you need to perform a given task.

Don’t use O3, O2 is often times faster. GCC’s vectorizing optimizer is a bit shitty as well. I had a similar issue with some CPU skinning code. I compile my native code with this at the moment:

-c -Wall -O2 -mfpmath=sse -msse2

The skinning code is now around 10% faster than the equivalent Java code (all manually inlined, all plain old data types etc. etc.).

If you want really good performance use the Intel compiler as suggested earlier. It will make a huge difference.

Your list of optimization switches is kinda short for GCC. You probably want to specify the arch for scheduling, omit frame pointers and relax FP rules.

Over the holidays I wiped out old linux old and fresh installed a newer version and to my dismay couldn’t find the free version of the intel compiler. So either my search skills are weak or they’ve discontinued it (big fat sadface).

I’m having a fit of work-avoidance-itus so I’ll take a quick stab at a rough pass at a slightly usable 3D simplex noise.