micro optimizations

Below is what I’ve found to be the most consistant performance boost with miminal optimization effort on a micro level. Other people are encouraged to post their pet optimizations.

bStructure boolean experssions to evaluate against zero.[/b]

Java, and many other languages/cpus, has special opt codes to test if something is equal to zero as opposed to the generic opt codes to test if a == b. I find this most useful with for loops.

For example:

int a[] = new int[1000000];
for (int i = 0; i < a.length; i++) {
// Do stuff
}

is slower than:

int a[] = new int[1000000];
for (int i = a.length-1; i >= 0; i--) {
// EDITED: Thanks to kenrod for noticing a bug, see his post below for details
// Do stuff
}

There is actually another optimization in the above example. The lookup of the length of a is only done once in the second example which also improves the speed, probably more than the test against zero optimization does.

For the record, I consider the follower to be even more optimized but I’ve found it makes the code harder to read so I never use it.

int a[] = new int[1000000];
for (int i = a.length; --i >= 0;) {
// Do stuff
}

This saves one fetch from memory because the decrement is done with the same fetch of i that is used in the boolean expression instead of being different steps for the decrement and then the test.

Hope this is useful and if I’m wrong please correct me.

Good idea to post these kinds of things. Who knows, it might help.
But, to make the tests more comparable and help to relativise, we should know the jvm you tested this against, and if that performance was tested in a full program, or just a bench.

Ever measured the effect?

I don’t have any numbers. I’ve yet to find a way to reliably reproduce performance numbers on a micro benchmark. I will say I mentioned this to the author of idx3d back in 2000 and he thought it was an improvement enough to mention me in the source code and he replied back with an email basicly saying “that was the single most effective optimization I’ve done.” At the time his target JVM was the one with Internet Exploder and Nutscrape 4.

I have yet to see a place where making your booleans test againt zero is slower, so I’d say it can’t hurt the performance level. Like I said, it’s my pet optimization. You’re free to take it or leave it.

Are your two examples equivalent?

Code:
int a[] = new int[1000000];
for (int i = 0; i < a.length; i++) {
// Do stuff
}

Code:
int a[] = new int[1000000];
for (int i = a.length; i >= 0; i–) {
// Do stuff
}

Given that the first loop does an iteration where ‘i’ equals ‘0’, won’t the second loop do an iteration where ‘i’ equals ‘a.length’?

And won’t, therefore, this break?

Doh! You’re right!

for (int i = a.length; i >= 0; i--)

Will result in an ArrayIndexOutOfBoundsException at the first iteration. The correct code is:

for (int i = a.length-1; i >= 0; i--)
// Note the "-1" after the a.length

This is why people say optimization is dangerous. You get back in town, think you’re being clever but you’re really forgetting one important detail and break your code.

Thanks for pointing this out kenrod.

No problem. Interestingly, your other example (which you don’t like to use because it’s hard to read) looks like it would solve this…

int a[] = new int[1000000]; 
for (int i = a.length; --i >= 0;) { 
// Do stuff 
}   

…because the evaluation (and hence the decrement) gets done before the first time through the loop.

Anyway, thank you for the code - I’m busy changing all my loops now to see what impact it has :slight_smile:

[quote]Anyway, thank you for the code - I’m busy changing all my loops now to see what impact it has :slight_smile:
[/quote]
If you think you can generate reasonable performance numbers for before and after then please do post the results.

I was under the impression that the counting backwards trick was only useful on older 386/486 chips. Due to an unfortunate me v. powertool accident last night I had some extra time to run this test today. In my little tests method A was faster than the other two methods. I didn’t do anything in the loop.

Timers are in nanoseconds as reported by J3DTimer with a resolution of 3ns. All methods were run 20 times with only the last ten runs being recorded.

Method A
Average: 19662318
High: 20229735
Low: 19544787

Method B
Average: 23496394
High: 25086930
Low: 21526614

Method C
Average: 24127960
High: 26512731
Low: 19953093

I took the code from IDX3D can converted the for loops to Method A, they were alread in method B, and ran a few tests. You can download the code I used at: http://flick.nerdc.ufl.edu/IDXBench/ I gotta run so I can’t do the math but here is enough output for you to play with. The output is the sample # and then FPS taken every 25 seconds. The test was run on a linux box with no other activity.

$ java -version
java version “1.4.1_01”
Java™ 2 Runtime Environment, Standard Edition (build 1.4.1_01-b01)
Java HotSpot™ Client VM (build 1.4.1_01-b01, mixed mode)

$ time java -cp . idx3d_TestApp > /dev/null
idx3dA.idx3d_Scene
0 29.99
1 32.25
2 30.68
3 30.94
4 32.42
5 32.0
6 31.77
7 32.25
8 32.32
9 31.18

real 4m12.390s
user 0m1.820s
sys 0m0.160s

$ time java -cp . idx3d_TestApp > /dev/null
idx3dB.idx3d_Scene
0 30.94
1 32.25
2 31.65
3 32.42
4 31.9
5 32.38
6 31.87
7 31.77
8 32.35
9 31.21

real 4m12.146s
user 0m1.960s
sys 0m0.120s

It looks like Method B is equally as fast or slightly faster.

I played around with such optimizations some time ago for my own software renderer (similar to IDX3D, but better of course… :P) and i was unable to meassure a postive effect of this optimization. A problem is, that you usually access the array in the “do stuff” section and it is far more cache friendly if you do this from start to end than backwards. So i ended up being slower sometimes when doing this “optimization”. Another idea i tried was to put “do stuff” in an endless loop and let Java’s range checking on arrays (takes place anyway) terminate it by generating an exception which i caught with a try-catch around that loop. Look ugly, feels ugly, should be faster but wasn’t.

First, I’m not going to pretend I’ve done any timings on this BUT, in the general case, this code would seem to be optimal (addressing your concerns EgonOlsen)…


int a[] = new int[1000000];
int j = -1;
for (int i = a.length; --i >= 0;) {
j++;
// Do stuff 
}

…where you use ‘j’ (not ‘i’) as your regular loop counter (then you don’t have to reverse your arrays or anything).

This seems like such a general case that I’d be tempted to think compilers could optimize this like this anyway, but then again they’re probably not sure whether ‘a.length’ will remain constant for every iteration of the loop (unless it was declared final or something).

I agree with leknor that most machine languages have special instructions for testing zero, as opposed to loading a value into a register (which itself may require a function call or two) and then checking it.

Mind you, these sorts of micro optimizations are pretty academic - they’re harder to read and not going to gain you anything over simply writing a better algorithm :slight_smile:

Kenrod: I’ve read that newer JVMs can detect loops that won’t iterate past the array size and remove bounds checking which is a respectable speed gain.

I wouldn’t be suprised if the method you suggest above with the j++; obscures the array access pattern enough that the JIT chooses not to remove bounds checking.

If someone has a link to a reasonably techinical article on how JITs determine when they can skip the bounds check on an array access I’d be intersted.

i’ve been testing those loop methods on my image filter benchmark, and here is what came out:

test 1, using:

int nbpixels=1024*1024;
for(int loop=0; loop< nbpixels; loop++)
{
  // filtering pixel at offset loop
}

55 million pixels filtered per second ( on 10s, 30s and 60s tests consistently )

test 2, using:

int loop=1024*1024;
for (; loop >=0; loop--)
{
//filtering pixel at offset loop
}

34 million pixels filtered per second. ( on 10s, 30s and 60s tests consistently )

test 3, using

int loop=1024*1024;
int pos=0;
for (; loop >=0; loop--)
{
//filtering pixel at offset pos++
}

50 million pixels filtered per second. ( on 10s, 30s and 60s tests consistently )

The conclusions i can take out of these tests:

  • read arrays from the beginning to the end
  • these optimisations don’t seem to be interesting.

The questions i have are: is the jvm testing at 0 using a special instruction, or does it use the standard, and loads 0 into a register prior testing? that would explain why there are no differences, if it can. We can also argue that register loading could take no time, if done in an other pipeline of the processor, bla bla bla… [paste the code for endless discussion here]

By the way, that was tested on a sun 1.4 client jvm.

The JVM does not “use” a special instruction to test against 0. Although it has one, it’s turned quite readily into whatever the fastest way to test it is on current hardware. As far as I am aware, the current Pentium 2/3/4 chips execute nearly all of their basic instructions in a single cycle somewhere deep in a pipe. There is, effectively, no performance difference, excepting that Hotspot probably tries to do its bounds checking on the cheap by looking for the common case of counting from 0…n.

Not too long ago this was a valid performance optimisation, but not any more. Not too long ago it was faster to do maths using fixed point integer representations, but not any more. Unless you’re working on ARM chips of course :slight_smile:

Cas :slight_smile:

On the pipeline story, even if the register is loaded using an other pipeline, you still have some overhead ( (eventual) save register to stack, load register, compare, (eventual) reload register from stack ) compared to normal one (compare)
Instructions of one cycle should not be pipelined, so the pipes are free for long operations. Thus, if they don’t compare with special instructions, it will still take more time.
Would be nice to get the info from the jjit team.

By the way, what is the op for the ‘compare to 0’ of x86?

You don’t have to do an explicit compare to 0. Just decrement the register (loop counter) and do a jnz (jump not zero) or a jns (jump not signed) to the start of the loop. If the register/counter reaches 0 then (or -1), the loop won’t be taken because the zero-flag/signed-flag is set and you are done.

Looks like we’ve got our x86 asm source. ;D

On x86, what is the register that is used for jumps?
As far as i remember, opposed to 68xxx, x86 have specialized registers. Are you sure this register would be useable to do all operations needed? If it needs to be pushed, popped from an other one in the loop, of course, the gain would be null.

You can use the standard registers (eax, ebx, ecx, edx) without problems for this. The main problem is the very limited number of registers on x86. You have to use your registers clever to optimize such things if the code inside the loop gets more complex.

cool, thanks a lot for the infos.

Would be nice to have the native assembly versions of our routines as created by the JIT, by the way.