To get any serious results, use the server VM:
Self-contained compilable sourcecode with the following results:
duration of last 8 runs:
---> arr: 264ms
---> buf: 1445ms
---> pnt: 240ms
arr = float[]-based
buf = FloatBuffer-based (~5.5x slower than float[])
pnt = pointer-based (~10% faster than float[])
Update:
It does a simple weight of 2 data-sources and stores the result in the third:
c = a*x + b*(1-x)
I’ve tried to optimize all three ways for best performance, by trial-and-error.
float[]
// unrolling this loop makes it slower
for (int i = 0; i < a.length; i++)
c[i] = aMul * a[i] + bMul * b[i];
FloatBuffer
while(fbA.hasRemaining())
{
fbC.put(aMul * fbA.get() + bMul * fbB.get());
fbC.put(aMul * fbA.get() + bMul * fbB.get());
fbC.put(aMul * fbA.get() + bMul * fbB.get());
}
pointer-aritmetic
for (int i = -4; i < bytes;)
{
unsafe.putFloat((i += 4) + c, unsafe.getFloat(i + a) * aMul + unsafe.getFloat(i + b) * bMul);
unsafe.putFloat((i += 4) + c, unsafe.getFloat(i + a) * aMul + unsafe.getFloat(i + b) * bMul);
unsafe.putFloat((i += 4) + c, unsafe.getFloat(i + a) * aMul + unsafe.getFloat(i + b) * bMul);
}
Update 2:
Using pointer-arithmetic, cache-misses kick in much much later - when using large data-sets.
When processing 512 vertices (18KB) float[] gets 125M/s, pointers get 133M/s.
When processing 1024 vertices (36KB) float[] gets 125M/s, pointers get 133M/s.
When processing 2048 vertices (72KB) float[] gets 125M/s, pointers get 133M/s.
When processing 4096 vertices (144KB) float[] gets 92M/s, pointers get 133M/s. <--
When processing 8192 vertices (288KB) float[] gets 42M/s, pointers get 133M/s. <--
When processing 16384 vertices (576KB) float[] gets 42M/s, pointers get 42M/s.
When processing 32768 vertices (1152KB) float[] gets 31M/s, pointers get 31M/s.
Makes you wonder what happens under the hood… 