I’ve spent some free time tonight working on the boss AI for the main rival in the game.
Mostly fixing some bugs. There were some bugs that caused sound spamming, and others that caused his spear to rotate strangely. Turned out to all be typos that took forever to find. Don’t you just love that?
I also did some GUI work to make it looks cleaner and make sure that elements that required your attention flashed appropriately to get that attention.
Not nearly as cool as the stuff agentd did and I don’t have any cool numbers for you, but I can say that my AI is hyper lethal AND stable.
The wrongly named transparency packing (e.g. the complete precomputation pass, to be renamed) went up by 0.1 ms, but the fast resolve pass shaved off 0.1 ms too. The full resolve pass also got quite a bit faster, losing almost 0.3 ms too. All in all, we’re down to just under 1.6 ms, a pretty amazing result considering the same process took 3.4 ms a few days ago.
[/quote]
With those precomputation optimizations is the fast resolve pass + stencil markin still a win?
With those precomputation optimizations is the fast resolve pass + stencil markin still a win?
[/quote]
Yes. When only doing 2xTSRAA both with and without stencil testing the performance is the same. I’m considering disabling the stencil mask in that specific case. For 4x, 8x and 16x, the stencil mask is a huge win as the fast resolve shader is equally fast regardless of sample count while the full resolve shader explodes.
I will double check my results tomorrow and compare it more extensively.
I lost my USB today which has all my code on it (I prefer USB rather to cloud just incase I ever need to go out and about with my laptop or another computer). I have a bad habit of not keeping up to date with my backups. Luckily, the most recent one is only last week. So yeah, not much…
What I would recommend is backing up to the cloud as well as having your backups on your USB drive. It’ll solve situations like what you’re in now, and source control isn’t very hard to use, especially when you’re the only one pushing changes.
Of course I just had to go and get a small idea while trying to sleep, and surprisingly it had a huge impact.
In the full resolve shader, for each color sample (5 from the current frame, 5 from the previous frame = 10 total color samples) I run the following code to sum up the color of all SRAA samples:
//float weight = reprojection weight based on motion vector length = 1.0 for the current frame and 0.0-1.0 for the previous frame
for(int i = 0; i < SRAA_SAMPLES; i++){
float w = weight * float(ids[i] == color.a);
sraaSamples[i] += vec4(color.rgb, 1) * w;
}
The idea here is to avoid the branch by casting the boolean ID matching result to a float, which converts true to 1.0 and false to 0.0, exactly what I want to multiply the weight by. When looking at the assembly code on both AMD and Nvidia hardware, it was a mess. On AMD, the code was riddled with what seemed like unnecessary MOVs that simply moved stuff around for no reason, and it also used ~25 registers. On Nvidia, it used a massive 36 vec4 registers while optimally it should be able to get by with less than 10. The compiler clearly didn’t do a good job there. Both the Nvidia and AMD code looked massively reorganized. The compilers had clearly reordered the instructions a lot, which bumped up the register requirements and introduced the MOVs it seemed.
Register count is a funny quirk about shaders. Shader cores rely on the ability to hide latency to stay busy and get optimal throughput. Basically when a shader core hits a texture read that’ll take some time to finish, the entire work group immediately switches to something else so that it doesn’t have to sit idle. For that reason, each shader core has (compared to a CPU) an unusually large number of registers so that it can fit many invocations of shaders at once and work on whichever isn’t blocked. If your shader uses a lot of them, it limits the number of invocations it can have loaded in at the same time, which can reduce performance in very unpredictable ways. Even worse, the register count is almost impossible to predict as it depends completely on the compiler. Anyway, since the main bottleneck of the full resolve shader seemed to be to be ALU performance (possibly amplified by the high register count) and that loop was pretty much the only ALU operations in the entire shader, it seemed to be worth trying some different things in an attempt to speed it up.
Out of curiosity, tried to simply rewrite the code like this:
for(int i = 0; i < SRAA_SAMPLES; i++){
sraaSamples[i] += vec4(color.rgb, 1) * weight * float(ids[i] == color.a);
}
In raw instruction count, this is actually slower. The original code multiplied together two floats (weightfloat(…)), then multiplied this new float by a vec3 (color.rgbw) for a total of 4 instructions. This new one-liner is actually a lot more instructions. First we do color.rgbweight which is 3 instructions. 1weight is simply optimized away. After that, we do *float(…), which is another 4 instructions, for a total of 7. The funny part is that according to the AMD Shader Analyzer, this slower code should perform 50% faster! The assembly looks completely different, but in the end it didn’t matter. Tests on both Nvidia and AMD hardware showed that performance was pretty much identical. The Shader Analyzer is most likely just using an older quirkier GLSL compiler than my live hardware.
I decided to try getting rid of my clever boolean–>float cast optimization and use if() statements instead, just to see what would happen. The idea was that if the driver is smart enough, it’ll do the same thing that I did but maybe better optimized as it’s free to make more liberal optimizations. Plugging the code into ShaderAnalyzer showed a grim future. Throughput was predicted to drop to half the original value, but at least the assembly resembled the original source code more. On Nvidia, the assembly even had the exact same structure as the source code which was a bit cool at least. I ran the code, expecting the shader to slow down to a crawl as I was doing 80 branches per pixel…
for(int i = 0; i < SRAA_SAMPLES; i++){
if(ids[i] == color.a){
sraaSamples[i] += vec4(color.rgb, 1) * weight;
}
}
BAM! 34-50% better performance on Nvidia and ~10% better performance on AMD! Why is it faster? Seems to be register count. The register count on Nvidia at 8xAA dropped from 37 to 19, and I assume something similar happened to AMD.
Here’s the performance summary. Note that the GTX 770 is a much faster card, so the difference between AMD and Nvidia isn’t relevant.
The if-statement seems to be faster in every case EXCEPT 16xAA on Nvidia hardware. In that specific case, the if-statements just drop.
What does all this mean? Well, before the new optimizations the stencil mask always had a positive or no impact. With these optimizations, 2xAA is actually a tiny tiny bit slower. 8xAA still gets a small boost from the stencil mask, and in special cases like when you’re looking up into the sky the stencil mask obviously works wonders. For now, I think I will stick to having it disabled though.
Fun fact: AMD doesn’t support 16xMSAA so I can’t test 16x on AMD. Nvidia actually doesn’t support 16x MSAA either, they just give you 2x2 OGSSAA + 4xMSAA. It’s even possible to force 32xAA through the Nvidia control panel which is 2x2 OGSSAA + 8xMSAA. If you’re really crazy, you can get 64x SSAA by turning on SGSSAA and running 2x2 OGSSAA + 8xSGSSAA + 2x SLI supersampling. I’m not sure if it’s supported, but you could theoretically get 128x supersampling with 4 GPUs.
EDIT:
Bonus chart! This is the result of the cumulative optimizations that I’ve talked about in my last 3 posts:
The code is a bit messy right now, but the gist of it is basically that it’s a nested loop. On one hand, you have 10 color samples with different weights, and on the other hand you have N SRAA samples (2, 4 or 8 ), so we basically have 10n iterations in total. This can be implemented in two ways. Either the SRAA samples are the inner loop (what I have now) or the color samples are the inner loop (what I originally had). The data for the inner loop needs to be sampled before the nested loop runs since the inner loop is run multiple times, and we can’t afford doing 10+10n samples instead of 10+n.
If the SRAA samples are the inner loop, it means we first read the N SRAA IDs (just 1 float) at the start. We also need to allocate a color and a total weight to divide by at the end, so in the end we need (4+1)*n values stored. For 2, 4 and 8 samples, that’s 10, 20 and 40 values stored. So, for each color sample (the outer loop), we add the color multiplied by its weight to all SRAA samples that it matches (the inner loop that I posted in my previous post).
If the color samples are the inner loop, we need all 10 color samples in memory (1 vec4 each for packed RGB+ID), meaning we need a constant 40 values stored. For each SRAA sample (the outer loop), we loop over all the color samples (the inner loop) and check which color samples that match and sum them up multiplied by their relevant weights.
To visualize all this with a simple code example…
//Version 1:
for(int i = 0; i < n; i++){
vec4 outerData = texture(..., i); //n samples
for(int j = 0; j < m; j++){
vec4 innerData = texture(..., j); //n*m samples, BAD!
//compare data...
}
}
//Total of n + n*m samples
//Version 2
//Cache inner loop data
vec4 innerLoopData[n];
for(int j = 0; j < m; j++){
innerLoopData[j] = texture(..., j); //m samples
}
for(int i = 0; i < n; i++){
vec4 outerData = texture(..., i); //n samples
for(int j = 0; j < m; j++){
vec4 innerData = innerLoopData[j];
//compare data...
}
}
//Total of n + m samples
As you can see, if you want to do only n+m samples, you need to cache the data of the inner loop. Choosing the N SRAA samples as the inner loop uses less memory as N*5 <= 40 (N = 2, 4, 8 ). Like I said, I used to do it the other way around before, but since I gained ~2-3x better performance by doing it the current way I obviously decided to stick with it.
@theagentd
I am one of those guys that likes backends so I don’t really care about opengl and shaders and stuff, but I am a fan of your explanations.
I enjoy reading your posts and the way you share the thoughts you had about the things you show.
I finally gave in to popular demand, and ported RFLEX to android using libGDX. Currently the game / menus are done, all that’s left to port to GDX is the editor (which is pretty much done) and the level select (haven’t even started :-).
Reactions among friends have ranged from “wow, that’s cool!” to “where do I get it?”. I am very happy with it.