micro optimizations

leknor · October 28, 2002, 1:54am

Below is what I’ve found to be the most consistant performance boost with miminal optimization effort on a micro level. Other people are encouraged to post their pet optimizations.

bStructure boolean experssions to evaluate against zero.[/b]

Java, and many other languages/cpus, has special opt codes to test if something is equal to zero as opposed to the generic opt codes to test if a == b. I find this most useful with for loops.

For example:

int a[] = new int[1000000];
for (int i = 0; i < a.length; i++) {
// Do stuff
}

is slower than:

int a[] = new int[1000000];
for (int i = a.length-1; i >= 0; i--) {
// EDITED: Thanks to kenrod for noticing a bug, see his post below for details
// Do stuff
}

There is actually another optimization in the above example. The lookup of the length of a is only done once in the second example which also improves the speed, probably more than the test against zero optimization does.

For the record, I consider the follower to be even more optimized but I’ve found it makes the code harder to read so I never use it.

int a[] = new int[1000000];
for (int i = a.length; --i >= 0;) {
// Do stuff
}

This saves one fetch from memory because the decrement is done with the same fetch of i that is used in the boolean expression instead of being different steps for the decrement and then the test.

Hope this is useful and if I’m wrong please correct me.

pepe · October 28, 2002, 11:31am

Good idea to post these kinds of things. Who knows, it might help.
But, to make the tests more comparable and help to relativise, we should know the jvm you tested this against, and if that performance was tested in a full program, or just a bench.

Herkules · October 28, 2002, 11:31am

Ever measured the effect?

leknor · October 28, 2002, 12:30pm

I don’t have any numbers. I’ve yet to find a way to reliably reproduce performance numbers on a micro benchmark. I will say I mentioned this to the author of idx3d back in 2000 and he thought it was an improvement enough to mention me in the source code and he replied back with an email basicly saying “that was the single most effective optimization I’ve done.” At the time his target JVM was the one with Internet Exploder and Nutscrape 4.

I have yet to see a place where making your booleans test againt zero is slower, so I’d say it can’t hurt the performance level. Like I said, it’s my pet optimization. You’re free to take it or leave it.

kenrod · October 28, 2002, 11:06pm

Are your two examples equivalent?

Code:
int a[] = new int[1000000];
for (int i = 0; i < a.length; i++) {
// Do stuff
}

Code:
int a[] = new int[1000000];
for (int i = a.length; i >= 0; i–) {
// Do stuff
}

Given that the first loop does an iteration where ‘i’ equals ‘0’, won’t the second loop do an iteration where ‘i’ equals ‘a.length’?

And won’t, therefore, this break?

leknor · October 29, 2002, 2:28am

Doh! You’re right!

for (int i = a.length; i >= 0; i--)

Will result in an ArrayIndexOutOfBoundsException at the first iteration. The correct code is:

for (int i = a.length-1; i >= 0; i--)
// Note the "-1" after the a.length

This is why people say optimization is dangerous. You get back in town, think you’re being clever but you’re really forgetting one important detail and break your code.

Thanks for pointing this out kenrod.

kenrod · October 29, 2002, 2:49am

No problem. Interestingly, your other example (which you don’t like to use because it’s hard to read) looks like it would solve this…

int a[] = new int[1000000]; 
for (int i = a.length; --i >= 0;) { 
// Do stuff 
}

…because the evaluation (and hence the decrement) gets done before the first time through the loop.

Anyway, thank you for the code - I’m busy changing all my loops now to see what impact it has

leknor · October 29, 2002, 4:23pm

[quote]Anyway, thank you for the code - I’m busy changing all my loops now to see what impact it has
[/quote]
If you think you can generate reasonable performance numbers for before and after then please do post the results.

GKW · October 29, 2002, 7:16pm

I was under the impression that the counting backwards trick was only useful on older 386/486 chips. Due to an unfortunate me v. powertool accident last night I had some extra time to run this test today. In my little tests method A was faster than the other two methods. I didn’t do anything in the loop.

Timers are in nanoseconds as reported by J3DTimer with a resolution of 3ns. All methods were run 20 times with only the last ten runs being recorded.

Method A
Average: 19662318
High: 20229735
Low: 19544787

Method B
Average: 23496394
High: 25086930
Low: 21526614

Method C
Average: 24127960
High: 26512731
Low: 19953093

leknor · October 29, 2002, 9:32pm

I took the code from IDX3D can converted the for loops to Method A, they were alread in method B, and ran a few tests. You can download the code I used at: http://flick.nerdc.ufl.edu/IDXBench/ I gotta run so I can’t do the math but here is enough output for you to play with. The output is the sample # and then FPS taken every 25 seconds. The test was run on a linux box with no other activity.

$ java -version
java version “1.4.1_01”
Java™ 2 Runtime Environment, Standard Edition (build 1.4.1_01-b01)
Java HotSpot™ Client VM (build 1.4.1_01-b01, mixed mode)

$ time java -cp . idx3d_TestApp > /dev/null
idx3dA.idx3d_Scene
0 29.99
1 32.25
2 30.68
3 30.94
4 32.42
5 32.0
6 31.77
7 32.25
8 32.32
9 31.18

real 4m12.390s
user 0m1.820s
sys 0m0.160s

$ time java -cp . idx3d_TestApp > /dev/null
idx3dB.idx3d_Scene
0 30.94
1 32.25
2 31.65
3 32.42
4 31.9
5 32.38
6 31.87
7 31.77
8 32.35
9 31.21

real 4m12.146s
user 0m1.960s
sys 0m0.120s

It looks like Method B is equally as fast or slightly faster.

EgonOlsen · October 30, 2002, 4:49am

I played around with such optimizations some time ago for my own software renderer (similar to IDX3D, but better of course… :P) and i was unable to meassure a postive effect of this optimization. A problem is, that you usually access the array in the “do stuff” section and it is far more cache friendly if you do this from start to end than backwards. So i ended up being slower sometimes when doing this “optimization”. Another idea i tried was to put “do stuff” in an endless loop and let Java’s range checking on arrays (takes place anyway) terminate it by generating an exception which i caught with a try-catch around that loop. Look ugly, feels ugly, should be faster but wasn’t.

kenrod · October 30, 2002, 7:19pm

First, I’m not going to pretend I’ve done any timings on this BUT, in the general case, this code would seem to be optimal (addressing your concerns EgonOlsen)…


int a[] = new int[1000000];
int j = -1;
for (int i = a.length; --i >= 0;) {
j++;
// Do stuff 
}

…where you use ‘j’ (not ‘i’) as your regular loop counter (then you don’t have to reverse your arrays or anything).

This seems like such a general case that I’d be tempted to think compilers could optimize this like this anyway, but then again they’re probably not sure whether ‘a.length’ will remain constant for every iteration of the loop (unless it was declared final or something).

I agree with leknor that most machine languages have special instructions for testing zero, as opposed to loading a value into a register (which itself may require a function call or two) and then checking it.

Mind you, these sorts of micro optimizations are pretty academic - they’re harder to read and not going to gain you anything over simply writing a better algorithm

leknor · October 31, 2002, 3:22am

Kenrod: I’ve read that newer JVMs can detect loops that won’t iterate past the array size and remove bounds checking which is a respectable speed gain.

I wouldn’t be suprised if the method you suggest above with the j++; obscures the array access pattern enough that the JIT chooses not to remove bounds checking.

If someone has a link to a reasonably techinical article on how JITs determine when they can skip the bounds check on an array access I’d be intersted.

pepe · October 31, 2002, 6:12am

i’ve been testing those loop methods on my image filter benchmark, and here is what came out:

test 1, using:

int nbpixels=1024*1024;
for(int loop=0; loop< nbpixels; loop++)
{
  // filtering pixel at offset loop
}

55 million pixels filtered per second ( on 10s, 30s and 60s tests consistently )

test 2, using:

int loop=1024*1024;
for (; loop >=0; loop--)
{
//filtering pixel at offset loop
}

34 million pixels filtered per second. ( on 10s, 30s and 60s tests consistently )

test 3, using

int loop=1024*1024;
int pos=0;
for (; loop >=0; loop--)
{
//filtering pixel at offset pos++
}

50 million pixels filtered per second. ( on 10s, 30s and 60s tests consistently )

The conclusions i can take out of these tests:

read arrays from the beginning to the end
these optimisations don’t seem to be interesting.

The questions i have are: is the jvm testing at 0 using a special instruction, or does it use the standard, and loads 0 into a register prior testing? that would explain why there are no differences, if it can. We can also argue that register loading could take no time, if done in an other pipeline of the processor, bla bla bla… [paste the code for endless discussion here]

By the way, that was tested on a sun 1.4 client jvm.

princec · October 31, 2002, 7:36am

The JVM does not “use” a special instruction to test against 0. Although it has one, it’s turned quite readily into whatever the fastest way to test it is on current hardware. As far as I am aware, the current Pentium 2/3/4 chips execute nearly all of their basic instructions in a single cycle somewhere deep in a pipe. There is, effectively, no performance difference, excepting that Hotspot probably tries to do its bounds checking on the cheap by looking for the common case of counting from 0…n.

Not too long ago this was a valid performance optimisation, but not any more. Not too long ago it was faster to do maths using fixed point integer representations, but not any more. Unless you’re working on ARM chips of course

Cas

pepe · October 31, 2002, 8:03am

On the pipeline story, even if the register is loaded using an other pipeline, you still have some overhead ( (eventual) save register to stack, load register, compare, (eventual) reload register from stack ) compared to normal one (compare)
Instructions of one cycle should not be pipelined, so the pipes are free for long operations. Thus, if they don’t compare with special instructions, it will still take more time.
Would be nice to get the info from the jjit team.

By the way, what is the op for the ‘compare to 0’ of x86?

EgonOlsen · October 31, 2002, 8:43am

You don’t have to do an explicit compare to 0. Just decrement the register (loop counter) and do a jnz (jump not zero) or a jns (jump not signed) to the start of the loop. If the register/counter reaches 0 then (or -1), the loop won’t be taken because the zero-flag/signed-flag is set and you are done.

pepe · October 31, 2002, 9:03am

Looks like we’ve got our x86 asm source. ;D

On x86, what is the register that is used for jumps?
As far as i remember, opposed to 68xxx, x86 have specialized registers. Are you sure this register would be useable to do all operations needed? If it needs to be pushed, popped from an other one in the loop, of course, the gain would be null.

EgonOlsen · October 31, 2002, 9:10am

You can use the standard registers (eax, ebx, ecx, edx) without problems for this. The main problem is the very limited number of registers on x86. You have to use your registers clever to optimize such things if the code inside the loop gets more complex.

pepe · October 31, 2002, 9:47am

cool, thanks a lot for the infos.

Would be nice to have the native assembly versions of our routines as created by the JIT, by the way.