I’ve been meaning to mention this, but keep forgetting. Changing the floor call to truncation will cause defects (if any of the inputs are negative). This should be choosing the lower coordinate of cube (cell) which contains the evaluation point. You might what to try using this instead and then check the results (vs. performance):
gx = (x > 0 ? (int)x : (int)(x-1)) & 0xff; // etc for gy,gz
(edit: Of course this is introducing 3 if’s. Leaving the code as-is is correct as long as the input space is limited to positive.)
And another complete aside: Perlin presented a replacement for gradient noise, called simplex noise, sometime around 2000/2001 which has less defects and is faster (although maybe not in low dimensions).
I’m not familiar with the more recent version of GCC. If memory serves there are multiple sets of options related to CPU specific code gen (as opposed to general optimization). The first set basically specifies the allowed set of opcodes to use and the second is the specific hardware to target for coding scheduling. So, for instance, you could only allow opcodes present in the P3, but set the scheduling for a newer processor. I’m pretty sure that they did add a basic switch that says “do the best you can for this machine” switch.
For SIMD, in addition to prefetching, since you’re array processing you could do non-temporal stores. Well, actually you could do non-temporal stores in the scalar code…just means directly writing _mm_XXX instead of the compiler doing it for you.
@CommaderKeith: If only this was more common. In theory runtime compiled code could virtually always be faster. (Go LLVM & Shark!)