I’ve managed to get render passes in my graphics abstraction for Vulkan working. Now it’s time to implement software command buffers for OpenGL. I decided to do an experiment to find out what the best way of encoding OpenGL commands would be.
Method 1: The most obvious way is to use objects as commands. For each command added to the command buffer, I create an object for that specific command, store the arguments for the command in it and add it to a list. This has the advantage of being easy to handle. Adding a new command is simply making a new implementation of an interface. However, I’ve heard that calling virtual methods like that can be slow, and in my case there would be 100s of different commands (although 95% would be used very rarely). To avoid insane amount of garbage generated, I’d need to pool commands aggressively.
int value = 0;
for(int i = 0; i < NUM_COMMANDS; i++){
value = objectCommands[i].process(value); //Each command is its own object
}
Method 2: Secondly, I wanted to try encoding the commands as a command ID number (an int) and then storing the arguments to the method encoded in the same int array after the command ID. To execute the command buffer, I’d read the command ID, do a switch() on it to find the correct function to call and then read arguments from the int array depending on the command. Floats would be encoded using floatToRawIntBits() and stored in the array too. This is much more complicated to maintain as adding a new command means manually encoding arguments to the int array, adding a new ID to the switch() statement and decoding the arguments again in the function, but it has the potential to be much faster since all commands are stored sequentially in memory instead of being spread out all over the entire heap in objects, and virtual method calls are avoided.
int value = 0;
for(int i = 0; i < NUM_COMMANDS; i++){
int cmd = intCommands[(i<<1) + 0];
int arg = intCommands[(i<<1) + 1];
switch(cmd){
case 0: value = ADD_COMMAND.process(value, arg); break; //Call singletons to calculate value.
case 1: value = SUB_COMMAND.process(value, arg); break;
case 2: value = MUL_COMMAND.process(value, arg); break;
case 3: value = DIV_COMMAND.process(value, arg); break;
}
Method 3: Thirdly, I wanted to try something something inbetween 1 and 2. Instead of creating a new object for each command invocation, I would create singletons of each possible command. When encoding a command, the singleton for that command would be placed in a list, and the arguments for it would be encoded into an int array again. This also has perfect cache coherence as all the command arguments are sequentially in memory, while being a bit easier to maintain as each command is again contained in its own class implementing an interface (although arguments still have to be encoded and decoded into the int list).
int value = 0;
for(int i = 0; i < NUM_COMMANDS; i++){
Command cmd = staticObjectCommands[i]; //References to just 4 singleton commands
int arg = staticObjectArgs[i];
value = cmd.process(value, arg);
}
My guess here was that method 1 would be significantly slower than the other two as the command objects will end up all over the heap after pooling and virtual methods are used. Comparing method 2 and 3, I assumed that they would perform almost identically as Java should essentially do the exact same thing as I’m doing internally (method 2 does a switch over the command ID while method 3 would internally do a switch over the command classes to pick the right function to call).
Actual results:
- Method 1: 16.204294 ms (29.375046 ms with shuffling)
- Method 2: 8.42079 ms
- Method 3: 15.7727375 ms
When cache coherency is good, method 1 and 3 are pretty much identical. If the command array of method 1 is shuffled (simulating what would happen after a few minutes of garbage collection and pooling), performance drops noticeably due to bad cache coherency, potentially becoming less than half as fast as method 3. Performance is apparently limited by the virtual function call overhead. What surprised me was that manually doing a switch-statement in method 2 was significantly faster than Java’s virtual method selection, up to 2x faster.
Something to remember is that this does not take into consideration command “encoding” or advanced decoding of arguments and is a fairly synthetic benchmark with only simple functions. Encoding is not as important as command buffers can be encoded from multiple threads, but decoding will be extremely time critical as it is done solely on the OpenGL thread. I think I’ll have to do more experiments before I decide on what to do…
EDIT: Hmm. Encoding command data into an int[] is pretty silly. Better to just use ByteBuffer or even a raw memory pointer to write data to. Something really awesome would be if I could “map” the buffer to a struct to give me a cleaner interface to it. Maybe it’s time to revisit the good old MappedObject stuff?