Took another shot at anti-aliasing improvements today.
When using MSAA with deferred shading, it’s pretty much mandatory to use some kind of system that only computes lighting and complex shaders on pixels that require it. At 8x MSAA, we’re literally talking 8x the amount of lighting work, which would grind anything to a halt. Traditionally, this has been done by analyzing the depth and normals of the 8 samples of each pixel and determining if the pixel can be shaded once or if all 8 samples need to be shaded for correctness. This test leads to a stencil mask, and then lighting is done twice with two different shaders. The first pass only processes pixels that can be shaded once and the second one only processes pixels that need all 8 samples shaded. Branching in the shader to select per-pixel or per-sample shading generally leads to extremely bad performance and scheduling (one per-sample shaded pixel forces all neighboring pixels to run at per-sample resolution as well), and stencil test allows for much much better scheduling by the GPU. Newer compute shader based techniques like tile-based deferred shading work on a similar note, postponing all pixels that need per-sample shading to a second “pass” in the same compute shader.
My temporal SRAA implementation does not require per-sample shading; the scene is rendered as usual and the current and previous frames are used as input to the resolve shader. The resolve shader however is quite expensive. TSRAA works by matching an MSAA-resolution ID buffer to the shaded color samples of the previous and current frame to reconstruct an MSAA-resolution shaded image. 8x TSRAA essentially gives us 2x temporally supersampled shading with 0 risk of ghosting and 8x MSAA edge quality, while still only requiring a per-pixel shaded scene. The thing is that the upsampling process involves checking the neighboring shaded pixels as well in an attempt to find color data for each MSAA ID sample. This means checking the center pixel and the 4 closest pixels for both the current and previous frame, e.g. 8 ID samples are matched against 10 color samples. In addition, 10 transparency samples need to be taken and overlayed on top of each color sample, and finally there’s a motion vector being sampled for temporal reprojection to the previous frame. All in all, the shader does a total of 29 texture samples and a large amount of ALU operations. At 2560x1440 and 8x TSRAA, this resolve pass took ~4.2ms, over 1/4th of the entire frame budget.
Since the resolve pass is so expensive, I looked for ways of reducing the cost of it. I observed that probably 95% of all pixels are trivial to resolve. If all 8 ID samples are identical, we are pretty much guaranteed to have a color sample for all IDs in the center pixel of the current frame, and possibly in the previous frame as well, but this is not guaranteed. The center samples are prioritized, so if a center pixel match is found the neighboring pixels aren’t used. I decided to try to generate a stencil mask to find the pixels were all samples had identical IDs and process them with an optimized much faster shader. The optimized shader only samples the current and previous center pixels, the transparency data of those two pixels and the motion vector for reprojection, e.g. only 5 texture samples instead of 29. Ghosting is prevented by checking if the previous frame has the same ID as the current frame. A few minutes of playing with Shader Analyzer gives the following stats:
- Full resolve shader: 705 instructions, 29 texture samples, 37 registers, 1.19 pixels/clock, slightly ALU limited
- Fast resolve shader: 47 instructions, 5 texture samples, 5 registers, 9.6 pixels/clock, heavily texture limited
Red pixels are pixels that had the expensive resolve shader run. Imgur murdered the quality though… q.q
Now, for this scheme to be faster, the combined cost of running the stencil mark pass, the fast resolve pass and the full resolve pass needs to be lower than the cost of simply running the full resolve shader on the whole screen. Performance results woooh!
Before:
After:
The time of the resolve pass went from ~4.0 ms to a combined cost of ~2.7 ms. The performance gain depends on the scene of course, but this was a fairly “noisy” scene with lots of pixels that required the full resolve pass. In many scenes, the sky will take up a large part of the screen, and those pixels become extremely cheap. More often than not, the full resolve pass drops to under 1ms. I have yet to find a scene that actually became more expensive due to the overhead of the stencil marking etc, so I will go with it permanently.