HotSpot doesn’t generate any parallel computations, but it will use SSE instructions when available. You can grep in the source for supports_sse & supports_sse2. There are a number of instructs which can have a big impact, even when computing a single result at a time.
One advantage with SSE is that x87 registers are stack based while the SSE registers directly addressable.
Isn’t code like _mm_mul_ss already SSE, or is that x87? The ASM dump is littered with instructions like MULSS.
Didn’t look at the ASM dump before now :persecutioncomplex: Yes the code uses SSE. x87 instructions starts with f, like fmul, fadd etc.
All of the mm* intrinsic functions are wrappers for SIMD opcodes (MMX on). And _mm_mul_ss is a SSE-1 instruction. As for the asm dump, you mean yours? If so, then well, you told it to use SSE-1!
@tom: reduce memory motion is part of the picture. In most cases the opcodes themselves will have a larger impact. Simple example:
Lowering f2i looks like this:
if (single_sse)
emit(cvtsi2ss(..)); // single op convert reg to reg
else if (double_sse)
emit(cvtsi2sd(..)); // single op convert reg to reg
else {
emit(fldcw(..)); // BAM!! Change control word. wait til everything is done and change internal hardware state.
emit(fist(...)); // store result to memory
emit(mov(...)); // load result to dst reg
emit(fldcw(..)); // BAM!! What? Again? Darn gotta restore it.
}
Well, we were trying to find out why the C code was slower than the JVM code (the math still is slower in C, I removed the branching bottleneck), and your and tom’s suggestion/insinuation was that the JVM uses SSE, but as the C version uses SSE too, it means it can’t be related to SSE. That was what confused me, so I double checked.
Anyway, thanks for your tips and suggestions, as it helped me getting rid of floor() and looking at the grad-function for alternative optimizations.
I missed this reply. My “guess” is that you’re on a machine newer than a P3 and the JIT is using newer instructions. I would be very curious to see what the JIT is producing. I’ve never run across a case where the VM was producing faster code.
[quote=“Roquen,post:27,topic:35575”]
This forum is ‘messed up’ – I miss a lot of replies too, I often have to manually check whether there are new replies… the RSS feed is kind of a solution…
[quote=“Roquen,post:27,topic:35575”]
I changed the “-march” to “pentium4” (I don’t know the names of modern archs) and it didn’t help. It could be that GCC is simply doing inefficient SSE: not reusing registers or converting intermediate SSE results to x87 when it inlines a method.
If I replace this code:
inline float lerp(float t, float a, float b)
{
return a + t * (b - a);
}
with (to my knowlegde) theoretically the same:
#define lerp(t, a, b) (a + t * (b - a))
the result is 10% slower, so GCC is not generating the same assembly code in this trivial example.
Have you tried running the C code natively and NOT via JNI and measured the time that way? I think what you are seeing is over head from using JNI and NOT C. JNI is notoriously slow.
Edit: never mind, you said that you did run the C version without JNI in the original post, my bad. This is indeed a surprising result!
You could try profiling the C version with gprof
http://www.cs.duke.edu/~ola/courses/programming/gprof.html
“Basically, it looks into each of your functions and inserts code at the head and tail of each one to collect timing information”
That’s going to skew results so badly… perlin-noise is taking so few cycles that injecting timing-code in all (inlined) methods will render any outcome useless.
You had mentioned there was barely a difference between inline and not, I thought this was worth a shot as it is very simple/quick to try.
That’s because it always gets inlined, as visible in the ASM code.
Wildern: VTune (and similar) profilers are much more interesting.
The bigger question (in my mind) is why the JIT is do so well. There are tons of things riven could do if he wanted to improve the native performance.
(edit) And if he was really interested in high performance…this should be done on the GPU.
Right, care to provide some insight instead of vague hand waving?
Everything you suggested was either just as fast or much slower. My own optimization doubled the performance.
Use SIMD. Branch elimination. Add a static init method that determines hardware arch. Use closest arch match for computation.
(edit) And use a better compiler.
Thanks. I’m not too confident about using SIMD in perlin noise, as it’s not that easy to vectorize at it branches everywhere. From what I read about other people’s attempts, the SIMD code is just as fast, until they add _mm_prefetch() which makes is significantly faster.
One question: if I pass the compiler the arch of the CPU I’m running, I expect that the compiler is generating the ‘fastest possible’ code for that CPU (just as fast as writing code for that specific arch). So if passing a (more recent) arch (flag) to the compiler doesn’t result in fast code, would it be safe to assume that writing specific code for that arch won’t change the performance either?
I’m looking at those free versions Visual Studio, and will post the result. I’m not expecting much, but who knows.
Admirable persistence. So you’re trying to replicate the performance using native code to beat or at least come close to the server VM? Pretty cool.
i never knew how many C experts there were here.
This seems like a good lesson for me in why I should not bother learning C. So much hassle, and it’s still slower.
I’ve been meaning to mention this, but keep forgetting. Changing the floor call to truncation will cause defects (if any of the inputs are negative). This should be choosing the lower coordinate of cube (cell) which contains the evaluation point. You might what to try using this instead and then check the results (vs. performance):
gx = (x > 0 ? (int)x : (int)(x-1)) & 0xff; // etc for gy,gz
(edit: Of course this is introducing 3 if’s. Leaving the code as-is is correct as long as the input space is limited to positive.)
And another complete aside: Perlin presented a replacement for gradient noise, called simplex noise, sometime around 2000/2001 which has less defects and is faster (although maybe not in low dimensions).
I’m not familiar with the more recent version of GCC. If memory serves there are multiple sets of options related to CPU specific code gen (as opposed to general optimization). The first set basically specifies the allowed set of opcodes to use and the second is the specific hardware to target for coding scheduling. So, for instance, you could only allow opcodes present in the P3, but set the scheduling for a newer processor. I’m pretty sure that they did add a basic switch that says “do the best you can for this machine” switch.
For SIMD, in addition to prefetching, since you’re array processing you could do non-temporal stores. Well, actually you could do non-temporal stores in the scalar code…just means directly writing _mm_XXX instead of the compiler doing it for you.
@CommaderKeith: If only this was more common. In theory runtime compiled code could virtually always be faster. (Go LLVM & Shark!)
The current C code is faster, but only because it uses another strategy (switch-table in C for grad()
, which is not as fast in Java). Besides that, maybe GCC is just producing inefficient machinecode.