Question: Branch Prediction Alleviating Performance Loss in Looped Conditionals?

A-Bomb · April 8, 2017, 11:18am

So I’ve had a question for a long time that I’ve just not yet found the answer to… or at least an answer that I’m %100 sure I’m interpreting correctly…
TL;DR at the bottom btw!

First off:
My latest framework for a software renderer I’m writing in Java uses a very fast sprite drawing system with efficient clipping, instantaneous orthogonal transformations and now fast integral alpha compositing built right into the draw function pretty seamlessly (I’m trying to waste as little cycles as I possibly can since it’s not hardware accelerated). The only thing is… I try not to repeat code as much as possible but when I first implemented the transform portion of the draw function I had to make some rather ugly looking exceptions in favor of speed (or perhaps more realistically, the illusion of speed? I don’t know, that’s what I’m here to find out). So basically (very basically) it’s set up like this at the moment:


public void drawSprite(blah... blah)
{
    //precalculate clipping offsets
    ...
    //interpret transforms
    ...
    switch(transform)
    {
         case 1:
         for(i)
             //precalculate y index stuff blah blah
             for(j)
                 //interesting stuff; actual blending and setting of pixels
         break;

         case 2:
         for(i)
             //precalculate y index stuff blah blah
             for(j)
                 //interesting stuff; actual blending and setting of pixels
         break;

         case 3:
         for(i)
             //precalculate y index stuff blah blah
             for(j)
                 //interesting stuff; actual blending and setting of pixels
         break;

         default:
         for(i)
             //precalculate y index stuff blah blah
             for(j)
                 //interesting stuff; actual blending and setting of pixels
         break;
    }//so basically the same exact stuff in each case but some expressions are
     //A) changed around and have different operations done on them and
    //B) results correlate DIRECTLY to the loop variables so a pre-loop
   //switch won't work
}

You can imagine how ugly it is every time I add functionality to this method because I have to add it 4 times. The only other way I could do it is if I put the switch inside the outer loop, then I’d only need to write the loops once and add to them whatever I need. The thing is that I’m afraid it will affect performance later. I’ve ran some tests and it doesn’t seem to have too big of a footprint with 64 24x24 sprites onscreen simultaneously, however I don’t know to what extent performance may degrade when there are bigger sprites and more of them as a result of those few extra couple thousand or so instructions.

TL;DR:
SO! My MAIN question to you, the Java community:

Do you think it is worth it to just stick with my current format and rewrite everything multiple times just to save on a few unnecessary cycles?
OR do you think that the technology I have at my disposal is intelligent enough to alleviate such an obvious bottleneck through hardware magic such as branch prediction? Specifically, wouldn’t this be an ideal situation in which the branch predictive properties of our modern CPUs would be able to assist? The branch predictor can detect long sequences of similar conditional results, right? Is my understanding of how that works even accurate? Any help to point me in the right direction is appreciated, thank you!

theagentd · April 8, 2017, 8:28pm

What you’re doing is probably the fastest solution, as in this case the switch statement only needs to run once per sprite instead of once per pixel. However, you’re right that branch prediction reduces the performance cost of branches if the branching is very coherent, which is obviously the case when every single pixel in the sprite takes the same path. However, branch prediction is far from perfect, and you can overwhelm the branch prediction “cache” which keeps track of the previous results branches, as the branch prediction hardware can only track a certain number of branches. This is extremely hardware dependent, so it’s hard to reason around best practices in that sense. My tip to you is simply to test it; try to simplify your drawSprite() function in the way you described (move the switch into the loop) and see what kind of effect it has on performance. Note that it is sometimes not worth doing a branch if it will only save you a couple of instructions, as the cost of the branch (even with branch prediction) can be higher than simply always executing the operations.

The whole thing is quite annoying indeed. It’d be really nice if the compiler could just deal with stuff like this. For example, let’s say that you have a loop that repeatedly calls a virtual function:


for(int i = 0; i < array.length; i++){
    interfaceImplementation.virtualFunction(array[i]);
}

AFAIK, in this case you will have to pay the cost of the overhead of calling a virtual function (where Java essentially does a switch() statement on the class of the object to determine which implementation of the function to use). It’d be nice if the compiler would realize that the same function is called repeatedly, looks up the function before the loop starts once, then just reuses that for the entire loop. In addition, you’d kinda want the compiler to be able to handle your case too (which one could argue is simpler). By coding:

public void drawSprite(blah... blah)
{
    //precalculate clipping offsets
    ...
    //interpret transforms
    ...

    for(i)
        //precalculate y index stuff blah blah
        for(j)
            switch(transform){
                 case 1:
                     //interesting stuff; actual blending and setting of pixels
                     break;

                 case 2:
                     //interesting stuff; actual blending and setting of pixels
                     break;

                 case 3:
                     //interesting stuff; actual blending and setting of pixels
                     break;

                 default:
                     //interesting stuff; actual blending and setting of pixels
                     break;

    }
}

you’d want the compiler to essentially compile the whole thing to the code that you wrote, essentially placing the switch statement outside the loop, then generating one loop for each of the possible branches. It’d be soooo cool if this was an actual thing… =/ I’m afraid I don’t have a good solution to your problem, but I’d probably choose the flexibility and maintainability of having the switch-statement inside the loop (so you can avoid duplicating the for-loops). I suspect the performance penalty will be relatively small.

TL;DR: Benchmark it (and post the results in this thread =P).

Roquen · April 8, 2017, 8:55pm

Sound like using a @FunctionalInterface might be the thing to do.

“let’s say that you have a loop that repeatedly calls a virtual function:”
If it can determine that the actual number of targets is one then it becomes a direct call, two or less it become an ‘if’ test between them, otherwise the standard read address and jump to it thing.

A-Bomb · April 9, 2017, 12:43am

Thanks a lot, I appreciate your in depth explanation. And I figured things might be as such. I also wish things could be that simple lol but I suppose I’ll do some more tests. Rather, I’ll go ahead and put the switch into the outer loop to condense the function for now; I’ll observe what I can as I include more functionality and move toward performing more stressful testing. This way, I can keep adding whatever functionality I require and making progress quickly and easily; and besides, after I’m done with it and the framework is nearing completion I could always simply revert it back to the way I had it before I use it or release it or decide whatever I’m going to do with it…! Anywho, thanks a bunch; I also really appreciate the insight into the theoretical and practical limitations of branch prediction!

theagentd · April 9, 2017, 5:10am

I actually did some benchmarking for this some time ago where I compared virtual function calls vs a switch statement used to enter different functions. Essentially, I needed it to implement a software command buffer for OpenGL where I queue up commands and then I can execute/replay that list of commands as many times as I want. In my test, I had four very simple “commands” that I stored in different ways and measured the time it took to execute the commands. The commands simply did one of four mathematical operations on an int (addition, subtraction, multiplication or division). I tried 4 different methods for “encoding” the commands:

The commands are stored as Command objects which hold the arguments for the command and a simple execute() function. To execute the commands, I simply loop over the Command[] and call execute() on each of the commands.
The commands are stored in an int[] as an int command ID followed by their arguments. To execute the commands, I loop over the int[], extracting the ID. Then I use a switch() statement on the int ID to go to the right function for that command, which in turn extracts its arguments from the int array.
The commands are stored as singleton Command objects (so only 4 Command objects exist in total) in a Command[], with the arguments for the commands in a separate int[]. To execute the commands, I loop over the Command[] and call a different execute() function which extracts its arguments from the int[] and then executes the command.
The commands are encoded as an int command ID followed by their arguments into a native memory pointer allocated and written to using Unsafe. To execute the commands I use a loop that extracts the next command ID. A switch() statement is used to go to the right function based on the command ID and the function itself extracts its arguments.

Here are the performance timings of executing 16 384 commands with the four techniques described above:

Object: 0.247459 ms
Int:    0.117624 ms
Poly:   0.239245 ms
Unsafe: 0.108612 ms

As you can see, the techniques using virtual functions are around half as fast as using a switch statement. Using a simple array of Command instances turned out to be the slowest solution due to worse memory coherency (not all the Command objects get laid out linearly in memory after each other, and this can get worse as the program runs and the GC moves things around). Even using only four singleton Command instances gave pretty bad performance, which tells us a lot of the raw overhead of virtual function calls.

The two techniques using switch() statements on a command ID were consistently over twice as fast as the two that use virtual function calls. Most likely this improved performance comes from the ability of the compiler to inline the non-virtual calls in each case of the switch() statement. The cost of picking the right function is the same, but the cost of CALLING the right function disappears thanks to inlining.

Roquen · April 9, 2017, 6:57am

Of course for perf it’s really hard to generalize. On average a switch will win when using it can allow optimizations that wouldn’t otherwise occur like not spilling registers between inside/outside the statement. So the amount of work inside/outside is a really bad guess.

A-Bomb · April 9, 2017, 6:16pm

I see. It’s funny that you mention your Command Object interface (using that as a general term) because what you described there is actually identical to the system I’m using to translate raw user input into game logic. There’s a big lookup table in the class for each supported type of device (sizes vary; for example, the mouse class uses only a size of 16); in my case, the keyboard holds an array of 65536 bytes. Each byte represents A) the state of the key and B) the Command index at which the previously assigned game command was stored in the global command table (for which a maximum of 128 commands can exist at one time). It’s actually pretty cool because not only is it an integral portion of my framework, it’s also an extremely versatile tool for debugging! I use it to debug literally everything in the framework! Because it works as simple as:

write a Command object
create an instance of it
assign it a key

And that’s all you gotta worry about! It gets automatically executed for you every frame based on whether or not the key you assigned it is pressed (or whether or not the command itself is ON in the case of a toggled command).