The Server VM (and Client VM sometimes) is driving me nuts…
I’m putting the contents of a float[] into a strided datastructure with a certain offset and width (like VBO interleaved stuff)
One element can look like: [x,x,X,X,X,X,x,x,x] (repeats iself)
The code for this is:
public final void putElements(int elementOffset, int elementLength, float[] array, int arrayOffset)
{
int dataOffset = this.dataOffsetOfElement(elementOffset);
int lastDataOffset = dataOffset + elementLength * stride;
while (dataOffset < lastDataOffset)
{
data[dataOffset + 0] = array[arrayOffset + 0];
data[dataOffset + 1] = array[arrayOffset + 1];
data[dataOffset + 2] = array[arrayOffset + 2];
data[dataOffset + 3] = array[arrayOffset + 3];
arrayOffset += width; // if I replace 'width' (which is 4) with '4', performance gets cut in half
dataOffset += stride;
}
}
Now the silly part is that it’s gets executed much faster with a switch-statement around it:
switch (width)
{
case 1:
// this is the madness
dataOffset += stride * elementLength;
for (int e = elementLength - 1; e >= 0; e--)
data[dataOffset -= stride] = array[arrayOffset + e];
break;
case 2:
// code optimized for width of 2
break;
case 3:
// code optimized for width of 3
break;
case 4: // running the code for width '4' inside the switch-statement is 21% faster than specialized method for width 4!! arg!
while (dataOffset < lastDataOffset)
{
data[dataOffset + 0] = array[arrayOffset + 0];
data[dataOffset + 1] = array[arrayOffset + 1];
data[dataOffset + 2] = array[arrayOffset + 2];
data[dataOffset + 3] = array[arrayOffset + 3];
arrayOffset += width;
dataOffset += stride;
}
break;
case 5:
// code optimized for width of 5
break;
default:
// very very slow generic case
for (int e = elementLength - 1; e >= 0; e--)
{
for (int i = width - 1; i >= 0; i--)
data[dataOffset + i] = array[arrayOffset + i];
arrayOffset += width;
dataOffset += stride;
}
break;
}
}
Making tiny changes, like replacing ‘width’ with ‘4’ makes the HotSpot VM drop everything and create some slow execution-path (maybe it requires an additional register, and it just ran out? Should that cause a performanceloss of 50%?)
When I code for performance, I have to find the fastest way with trial and error, the difference can be up to factor 3, when applying lots of seamingly irrelevant changes, or putting a lot of never-to-be-executed code around it, to make it 21% faster. And then that will only be the fastest loop on that version of that VM vendor.
I’m going to implement this tiny loop in C and make a DLL, that way I’m certain the code is natively compiled properly on every VM. it’s such a shame HotSpot isn’t predicatable.