strange performance observation

Hi all,

I was just toying around with trying to optimize some things in my emulator and found out something which apparently has quite some impact on performance.
Somewhere in the rendering in an inner loop, I do this:


                            vdp.pixels[p++] = ((b & 0x80) == 0x80) ? fc : bc;
                            vdp.pixels[p++] = ((b & 0x40) == 0x40) ? fc : bc;
                            vdp.pixels[p++] = ((b & 0x20) == 0x20) ? fc : bc;
                            vdp.pixels[p++] = ((b & 0x10) == 0x10) ? fc : bc;
                            vdp.pixels[p++] = ((b & 0x08) == 0x08) ? fc : bc;
                            vdp.pixels[p++] = ((b & 0x04) == 0x04) ? fc : bc;
                            vdp.pixels[p++] = ((b & 0x02) == 0x02) ? fc : bc;
                            vdp.pixels[p] = ((b & 0x01) == 0x01) ? fc : bc;

This renders one line (8 pixels) of a character and tests per bit if the pixel should be rendered in foreground color or background color.

Then, as a test, I tried to change it to the following code (guessing that somehow comparing to 0 might be slightly faster):


                            vdp.pixels[p++] = ((b & 0x80) != 0) ? fc : bc;
                            vdp.pixels[p++] = ((b & 0x40) != 0) ? fc : bc;
                            vdp.pixels[p++] = ((b & 0x20) != 0) ? fc : bc;
                            vdp.pixels[p++] = ((b & 0x10) != 0) ? fc : bc;
                            vdp.pixels[p++] = ((b & 0x08) != 0) ? fc : bc;
                            vdp.pixels[p++] = ((b & 0x04) != 0) ? fc : bc;
                            vdp.pixels[p++] = ((b & 0x02) != 0) ? fc : bc;
                            vdp.pixels[p] = ((b & 0x01) != 0) ? fc : bc;

I didn’t really expect much of change (if any), but this code made the whole emulator a whopping 8% slower! Considering there’s much more going on in the emulator, and that this code wasn’t even the main bottleneck, this performance degradation struck me as quite extreme.
Furthermore, I think this little change in code should (in a perfect world) not have made any difference in performance at all because imho hotspot should be able to compile both code to something optimal.

What do you think, should I create a (quite uncritical ;)) bug report, or is there a good explanation for this?

BTW, I tested this on the java6 server VM.

-edit-: the change in code made it not 8% slower but almost 10%.

Wow. This would be a great time to build the debug VM and see what code Hotspot is spitting out. I’m pretty sure there is a flag for that, but it is only enabled in the debug builds.

You might also notice a real performance boost if you replace the

p++
p++
p++
p++

by

p+0
p+1
p+2
p+3

p+=7

for unknown reasons. :-\

To get rid of the 2 million and one if statements (which just are slow…)


int[] bg_fg = new int[]{ bg, fg}; // do not create a new int[] every time, cache it somewhere

vdp.pixels[p+0] = bg_fg[ (b >> 7) & 1 ];
vdp.pixels[p+1] = bg_fg[ (b >> 6) & 1 ];
vdp.pixels[p+2] = bg_fg[ (b >> 5) & 1 ];
vdp.pixels[p+3] = bg_fg[ (b >> 4) & 1 ];
vdp.pixels[p+4] = bg_fg[ (b >> 3) & 1 ];
vdp.pixels[p+5] = bg_fg[ (b >> 2) & 1 ];
vdp.pixels[p+6] = bg_fg[ (b >> 1) & 1 ];
vdp.pixels[p+7] = bg_fg[ (b >> 0) & 1 ];
p += 8;

But the client VM (in 1.5 and older) might insert array-bound-checks there. Oh well :slight_smile:

It’s actually not that unknown…

In the incremental method the next result is dependent on the previous, so the CPU has a difficult time executing them in parallel. Where as in the second approach each line is independent of the others so can be executed out of order, in parallel, or otherwise scheduled in a manner that is best suited for the CPU.

Actually in my case p++ instead of p+1, p+2 etc was faster.
Creating the 2-length array of colors did give me a nice speedup. Now I can emulate 10 MSX machines at once on my laptop :slight_smile:

Ah, nice. How many could you run before? :slight_smile:

9.2, now it’s 10.3 :slight_smile:
It turns out the rendering was a bigger performance bottleneck than I thought (based on -Xprof output, which just seems to be plain incorrect as it shows that this part takes less than 10% CPU time).

Xprof (or any profiler) changes the running code in such a way, or has such a big overhead that in my experience it is best to assume the percentages are in the range of 5% more or less than it actually claims. For me the best way to profile really important code (that I know is the bottleneck) is to temporary scatter profiling-code in my code and compare that to the time it took to to render that frame.

The real downside of Xprof is that you can’t start/stop profiling whenever you want, so your results are always scewed (unless you run it extremely long, doing the same thing)

The NetBeans profiler is worse than Xprof in that aspect, as Xprof gets rid of inlined methods, NetBeans measures those too, increasing the overhead significantly.

[quote]Xprof (or any profiler) changes the running code in such a way, or has such a big overhead that in my experience it is best to assume the percentages are in the range of 5% more or less than it actually claims.
[/quote]
Now it seems to be actually much more off. I mean if it takes less than 10%, the maximum speedup I can get is like 10% when I take it out completely. But I have achieved a bit more than that already with only some minor tweaks.

Also, the main bottleneck is getting the software rendered image to the video card and displaying it. According to -Xprof this takes up about 35% but when I comment this out, the emulator runs at 1600fps instead of 500, so I suppose -Xprof should actually have measured it as taking about 70%.
Although I suppose it could be possible that hotspot completely optimizes the software rendering away too because the result is not used anymore in that case (I’ll check the profiler output)…