Passing values to/between arrays

Riven · May 27, 2005, 11:28am

As posted in ‘Shared Code’ while discussing another subject. I thought it more ontopic here.

~~

Compare these three ways of doing the same thing:

A (array-filler)

             srcIndex = 0;
            dstIndex = 0;

            for (int i = 0; i < count; i++)
            {
               dst[dstIndex++] = src[srcIndex++];
               dst[dstIndex++] = src[srcIndex++];
               dst[dstIndex++] = src[srcIndex++];
               dst[dstIndex++] = src[srcIndex++];
            }

B (slightly adjusted, avoiding integer-increments)

             srcIndex = 0;
            dstIndex = 0;

            for (int i = 0; i < count; i++)
            {
               dst[dstIndex] = src[srcIndex];
               dst[dstIndex + 1] = src[srcIndex + 1];
               dst[dstIndex + 2] = src[srcIndex + 2];
               dst[dstIndex + 3] = src[srcIndex + 3];
               dstIndex += 4;
               srcIndex += 4;
            }

C (as B, but unrolled loop)

             srcIndex = 0;
            dstIndex = 0;

            for (int i = 0; i < count; i+=4)
            {
               dst[dstIndex] = src[srcIndex];
               dst[dstIndex + 1] = src[srcIndex + 1];
               dst[dstIndex + 2] = src[srcIndex + 2];
               dst[dstIndex + 3] = src[srcIndex + 3];
               dst[dstIndex + 4] = src[srcIndex + 4];
               dst[dstIndex + 5] = src[srcIndex + 5];
               dst[dstIndex + 6] = src[srcIndex + 6];
               dst[dstIndex + 7] = src[srcIndex + 7];
               dst[dstIndex + 8] = src[srcIndex + 8];
               dst[dstIndex + 9] = src[srcIndex + 9];
               dst[dstIndex + 10] = src[srcIndex + 10];
               dst[dstIndex + 11] = src[srcIndex + 11];
               dst[dstIndex + 12] = src[srcIndex + 12];
               dst[dstIndex + 13] = src[srcIndex + 13];
               dst[dstIndex + 14] = src[srcIndex + 14];
               dst[dstIndex + 15] = src[srcIndex + 15];
               dstIndex += 16;
               srcIndex += 16;
            }

Benchmark (Client 1.5 VM)
A: 1978.8ms
B: 1179.7ms
C: 0973.7ms

Benchmark (Server 1.5 VM)
A: 1525.5ms
B: 0918.3ms
C: 0718.7ms

First, what is causing the bottlenecks when using increments (++)

Second, why can’t the server VM optimize these 3 cases to equal optimized-code.

~~

About the first question… I can imagine that A cannot be executed in parallel, while B and C can. I am however running a non-HT P4, and don’t know whether such a cpu can do anything like that in parallel.

Note that this code isn’t meant to just copy arrays, as Sysstem.arraycopy(…) is even faster: 640ms. This code is only meant to show the (unexpected!) difference in performances with a very slight modification.

erikd · May 30, 2005, 10:24am

I tried it out too, but on my machine using 1.4.2, all 3 cases run exactly as fast on the server vm (and a multitude faster than on the client). The difference you described only happened to me on the client.
Why the speed differences on the client occur, I don’t know, but I never expect the client VM to optimize very well anyway.

K.I.L.E.R · May 30, 2005, 11:07am

[quote][…]I never expect the client VM to optimize very well anyway.
[/quote]
Why is that?
The client should be able to unroll loops and do the same optimisations as server.

princec · May 30, 2005, 11:50am

Nah, the client is a rather crappy optimiser compared to the server VM. Shame really, as they’ve hyped it far beyond its capabilities.

Cas

Jeff · May 30, 2005, 8:35pm

To answer your querstion…

My suspicion is that the increments sufficiently hide the value of the index from the optimzier by decoupling it from the loop. Thus, you have confused the optimizer sufficiently to defeat bounds check hoisting.

This is an excellent eample of why the clearest, cleanest code in Java is often the fastest-- its the easiest for the system to understand and optimize.

Ofcourse ALL of these are WRONG.

Use System.arraycopy() for fastest perfromance.

Riven · May 31, 2005, 3:55am

Hm… wrong? Depends on the situation I think…

Would you really write a lot of small arrays (size: 3, amount: >1000) to 1 big float[] with System.arraycopy(…) ?

erikd · May 31, 2005, 4:18am

[quote]To answer your querstion…

My suspicion is that the increments sufficiently hide the value of the index from the optimzier by decoupling it from the loop. Thus, you have confused the optimizer sufficiently to defeat bounds check hoisting.
[/quote]
Since when does the client VM do any bounds check removal at all? Maybe I missed something?
At first I had the same suspicion as you, but I don’t think this is the case here.
Doing the following takes (on the client VM) even the longest time to complete:


for (int i = 0; i < count; i++) 
{ 
  dst[i] = src[i];
}

And again, on my machine it all doesn’t make any difference on the server VM. All equally fast.

Riven · May 31, 2005, 4:21am

Takes longest? It does 4x less…

erikd · May 31, 2005, 4:30am

[quote] It does 4x less…
[/quote]
no it doesn’t, not in my slightly modified version of the original benchmark anyway

Jeff · June 2, 2005, 7:59pm

Hm. Im not sure where the breakpoint is on System.arraycopy to be honest. But keep thsi in mind, optimizing things that doeont matter doesnt matter and is a waste of your time.

I generally dont worry about array copy overhead unelsswe are talking about big arrays… thats the only time its likely to really matter.

Jeff · June 2, 2005, 8:00pm

Since when does the client VM do any bounds check removal at all?
[/qupte]

Im pretty sure its done bounds check hoisting for quite awjhile. Thats not a hard optimization. But I can ask someone to make sure.
[/quote]

ajiva · June 2, 2005, 8:46pm

No Client does not do any bounds checking hoisting or removal…

Jeff · June 2, 2005, 8:54pm

Really? Im rather surprised. On closer inspection though I just realized that its likely that none of these are being
hosited, so certainly your right thats not what he’s seeing.

Do you have a better explaination then for his micro-benchmark results?

Linuxhippy · June 13, 2005, 1:32pm

I also came across a funny conclusion, self-unrolled loopes are not always the best.

I unrolled a loop which look like:
while(–n > 0)
{
dest[n] = source[n] ^ … * …
}

to 20 calculations at once and tried several ways to optimized it, but the loop above still bet all when run on the server-jvm. On the client-jvm the hand-unrolled loops were faster of course…

lg Clemens