As posted in ‘Shared Code’ while discussing another subject. I thought it more ontopic here.
~~
Compare these three ways of doing the same thing:
A (array-filler)
srcIndex = 0;
dstIndex = 0;
for (int i = 0; i < count; i++)
{
dst[dstIndex++] = src[srcIndex++];
dst[dstIndex++] = src[srcIndex++];
dst[dstIndex++] = src[srcIndex++];
dst[dstIndex++] = src[srcIndex++];
}
B (slightly adjusted, avoiding integer-increments)
srcIndex = 0;
dstIndex = 0;
for (int i = 0; i < count; i++)
{
dst[dstIndex] = src[srcIndex];
dst[dstIndex + 1] = src[srcIndex + 1];
dst[dstIndex + 2] = src[srcIndex + 2];
dst[dstIndex + 3] = src[srcIndex + 3];
dstIndex += 4;
srcIndex += 4;
}
C (as B, but unrolled loop)
srcIndex = 0;
dstIndex = 0;
for (int i = 0; i < count; i+=4)
{
dst[dstIndex] = src[srcIndex];
dst[dstIndex + 1] = src[srcIndex + 1];
dst[dstIndex + 2] = src[srcIndex + 2];
dst[dstIndex + 3] = src[srcIndex + 3];
dst[dstIndex + 4] = src[srcIndex + 4];
dst[dstIndex + 5] = src[srcIndex + 5];
dst[dstIndex + 6] = src[srcIndex + 6];
dst[dstIndex + 7] = src[srcIndex + 7];
dst[dstIndex + 8] = src[srcIndex + 8];
dst[dstIndex + 9] = src[srcIndex + 9];
dst[dstIndex + 10] = src[srcIndex + 10];
dst[dstIndex + 11] = src[srcIndex + 11];
dst[dstIndex + 12] = src[srcIndex + 12];
dst[dstIndex + 13] = src[srcIndex + 13];
dst[dstIndex + 14] = src[srcIndex + 14];
dst[dstIndex + 15] = src[srcIndex + 15];
dstIndex += 16;
srcIndex += 16;
}
Benchmark (Client 1.5 VM)
A: 1978.8ms
B: 1179.7ms
C: 0973.7ms
Benchmark (Server 1.5 VM)
A: 1525.5ms
B: 0918.3ms
C: 0718.7ms
First, what is causing the bottlenecks when using increments (++)
Second, why can’t the server VM optimize these 3 cases to equal optimized-code.
~~
About the first question… I can imagine that A cannot be executed in parallel, while B and C can. I am however running a non-HT P4, and don’t know whether such a cpu can do anything like that in parallel.
Note that this code isn’t meant to just copy arrays, as Sysstem.arraycopy(…) is even faster: 640ms. This code is only meant to show the (unexpected!) difference in performances with a very slight modification.