Temporal Subpixel Reconstruction Anti-Aliasing

TL;DR: http://screenshotcomparison.com/comparison/95587/picture:0

Implemented a new anti-aliasing technique into We Shall Wake. It’s based on SRAA, Subpixel Reconstruction Anti-Aliasing, but I’ve extended it with a temporal component. As far as I know, no one’s ever done this before, but the result was really surprisingly good. For the people who don’t care about the implementation, there are pretty pictures at the end. =3

Some relevant background

FXAA (Fast approXimate Anti-Aliasing) is a really cheap anti aliasing filter that has become quite popular lately. For its super-low performance cost and easy integration into existing game engines, it does a decent job, but it has some very heavy limitations. FXAA works by analyzing colors on the screen in an attempt to find sharp jagged edges and then blends together pixels in an attempt to smooth them out. It’s implementation is actually extremely clever and it requires a very low number of instructions and texture samples to do its job, hence it’s really fast. Although it can eliminate some of the most glaring artifacts of rasterized graphics (staircase jaggies), it is still limited by the information in the colors of the frame. A triangle that is too thin to be rasterized will not be rendered in the first place, and FXAA cannot do anything to reconstruct that information.

Temporal supersampling (TSSAA) is a technique which exploits the temporal coherence of the screen. By assuming that each rasterized frame won’t change very much from the previous frame, it makes sense to combine the previous and current frame to achieve a better rasterized approximation of the geometry at hand. By rendering every other frame with a sub-pixel offset, we can often double the coverage information we have to work with. The quality is in optimal cases equal to 2x spatial supersampling. Obviously, this technique has some glaring drawbacks. The scene is not static between frames, so motion between frames introduce ghosting. There are techniques that can reduce this, but no technique that can eliminate it in all cases. The technique is therefore quite unpopular, despite its potentially massive gains for almost no performance cost at all.

MultiSample Anti-Aliasing (MSAA) is the de facto standard of anti-aliasing. MSAA involves computing multiple coverage samples for each pixel in hardware, but only running the fragment shader once for each pixel. These samples can then be very efficiently written to RAM using compression to save bandwidth. The great thing about MSAA is that it actually provides a large amount of extra coverage information. 4x MSAA provides four times as much information about the triangles we rasterized, so we can much better approximate their shape. Sadly, MSAA has a number of large drawbacks. It requires a large amount of extra video memory, and often carries a large performance hit. It’s not easy to use with deferred lighting, carrying an even worse performance hit, and for indie developers it’s also extremely difficult and time-consuming to implement, often requiring a completely different rendering pipeline to work.

SRAA (Subpixel Reconstruction Anti-Aliasing) builds on the idea of FXAA and attempts to improve on it. SRAA still relies on blending pixels together to reduce aliasing, but the way it decides how to blend is completely different. SRAA involves rendering the scene twice. First the scene is rendered and shaded as usual (just like for FXAA). In the second pass, the scene is rendered again to an MSAA render target to produce extra coverage information. In final resolve pass, for each coverage sample we look through the available color samples and pick the best possible candidate, effectively upsampling our non-MSAA color buffer to an MSAA color buffer. Although it is still limited to the same color information as FXAA, the way it detects edges is identical to MSAA. This means that it can’t handle sub-pixel geometry in the scene, but it CAN handle subpixel motion. This is a massive improvement over FXAA for scenes with any movement, but one of the most glaring drawback of FXAA is still there.

TSRAA

Temporal SRAA extends SRAA with a temporal component. The problem with SRAA is that it is limited by the limited amount of color information available in the non-MSAA color buffer. Temporal supersampling can double the amount of information we have access to essentially for free. The advantage of combining SRAA with temporal supersampling is that with SRAA we already have a way of identifying which color sample to fetch for a given coverage sample, so we can avoid ghosting by only sampling the previous frame if we are sure that we’re sampling the exact same triangle as the one we have in the current frame. There are still cases were minor ghosting can occur, but together with the standard ghosting reduction techniques, this can be reduced at unnoticeable levels.

Screenshots

To ease comparison of the “subtle” effects of anti-aliasing, I have uploaded them to ScreenshotComparison. These scenes have been rendered at 1/4 resolution to better show the effect of the anti-aliasing. Note that this makes TSRAA work less good, as this effectively makes all triangles cover much fewer pixels, making it more difficult to reconstruct the coverage of the scene.

NOTE There are 3 different screenshot comparisons in there!!! http://screenshotcomparison.com/comparison/95587/picture:0

Performance

Performance of this technique is excellent, as it does not require additional shading compared to no anti-aliasing. The only additional work required is the second pass (which of course can be prohibitive if the cost of processing each vertex twice is high) and the resolving pass, which I’ve managed to optimize quite a bit compared to the reference implementation.

[tr][td]No anti-aliasing[/td][td]148 FPS[/td][td]6.76 ms[/td][/tr]
[tr][td]FXAA[/td][td]139 FPS[/td][td]7.19 ms[/td][/tr]
[tr][td]4x SRAA[/td][td]124 FPS[/td][td]8.06 ms[/td][/tr]
[tr][td]4x TSRAA[/td][td]120 FPS[/td][td]8.33 ms[/td][/tr]
[tr][td]8x TSRAA[/td][td]105 FPS[/td][td]9.52 ms[/td][/tr]

We get a cost of around 1.57 ms for 4x TSRAA 1920x1080p. A very unprofessional comparison reveals that Battlefield 4, which implements deferred MSAA, runs at 73 FPS (13.70 ms) without MSAA and 59 FPS (16.95 ms)with 4x MSAA enabled, meaning that 4x MSAA has a cost of 3.25 ms in that game, more than twice what my technique uses.

Memory usage is also very low since the G-buffer does not require MSAA. Here’s a comparison table of how much memory different techniques use/would use if I implemented them.
[tr][td]Technique[/td][td]G-buffer memory usage[/td][td]Additional memory[/td][td]Total at 1920x1080[/td][/tr]
[tr][td]No anti-aliasing/FXAA[/td][td]26 * resolution[/td][td]None[/td][td]51.4 MB[/td][/tr]
[tr][td]4x MSAA[/td][td]104 * resolution[/td][td]4 * resolution (for resolving)[/td][td]213.6 MB[/td][/tr]
[tr][td]8x MSAA[/td][td]208 * resolution[/td][td]4 * resolution (for resolving)[/td][td]419.2 MB[/td][/tr]
[tr][td]4x SRAA[/td][td]26 * resolution[/td][td]24 * resolution[/td][td]98.9 MB[/td][/tr]
[tr][td]4x TSRAA[/td][td]26 * resolution[/td][td]32 * resolution[/td][td]114.7 MB[/td][/tr]
[tr][td]8x TSRAA[/td][td]26 * resolution[/td][td]56 * resolution[/td][td]162.2 MB[/td][/tr]
4x TSRAA uses around half the memory usage of 4x MSAA. Results are even better for 8x TSRAA and 8x MSAA.

Quality analysis

The important part here is how the techniques handle sub-pixel details and sub-pixel motion. SRAA itself cannot handle sub-pixel details, but temporal supersampling can. As long as triangles are thicker than 0.5 pixels, TSRAA will accurately reconstruct them. As long as this criteria is filled, TSRAA can then represent coverage at the given sample rate. When it comes to sub-pixel motion of triangle edges, 4x TSRAA has a precision equal to 4x MSAA.

Further work

  • More anti ghosting measures.
  • Tone mapping of each coverage sample before resolving instead of afterwards.

This is very impressive! How difficult is this to implement compared to MSAA?

Not extremely difficult. Since I already have code to allow for multiple views, it was simple to add the second SRAA coverage pass to the pipeline inside an if-statement. A few months ago or so, I tried doing the TSRAA resolve last after tone mapping and everything for maximum quality, but this turned out to produce horrible ghosting on transparent features like particles, and selectively turning of the temporal component just made the per-frame jitter cause massive flickering. If you’re just using normal SRAA, this isn’t a problem of course, but the temporal component is in my opinion more than necessary to get quality that can be compared to MSAA. In the end, I ended up doing the TSRAA resolve in the middle of the pipeline, at the same point I would’ve done an MSAA resolve. I have one massive shader which upsamples my low-res transparency (transparency is rendered at half resolution and upsampled to full resolution using a few tricks). If TSRAA is enabled, it also takes in the SRAA coverage buffer and the previous frame, does the transparency upsampling per pixel and outputs the anti-aliased scene.

So to summarize:

a. Apply subpixel jitter to projection matrix.

  1. Render scene as usual.

b. Render scene again to MSAA GL_R16F + depth textures.

  1. Light scene as usual.
  2. Postprocess as usual.
  3. In the transparency upsampling pass either upsample as usual or
    c. Resolve TSRAA and upsample transparency using the depth samples from the MSAA pass.
  4. Tone mapping, FXAA (if enabled), copy to screen.

MSAA would require completely rewriting both the shaders and the OpenGL code that does step 2 and 3, which essentially is half the engine.

EDIT: I’m gonna improve the quality a bit today. Stay tuned (or not >__>)!

thats very interesting stuff!

[quote=“theagentd,post:3,topic:51443”]
thats sort of what i am chasing. figured, if we paint transparent objects, it’s always the trouble-child. how do you handle that ? with MSAA it seems to be not that bad.

not sure if i understand you - what you describe in step b., grabbing the depth information, is required to remove halos ?

what do you actually store in the R16F buffer ?

I’m not entirely sure what you’re talking about when it comes to transparency, but like I said, I never rendered transparent stuff to the G-buffer in the first place. Transparency is done at half resolution (1/4th the memory usage) and is then upsampled using a bilateral filter. For each full-resolution pixel, I check the 4 closest transparency pixels and pick the one(s) that have the most similar depth.

My G-buffer essentially looks like this:

Texture 1: GL_RGBA8: Diffuse R, G, B,
Texture 2: GL_RGBA16F: Packed normal X, Y, roughness, specularity
Texture 3: GL_RG16F: Motion vectors X, Y
Texture 4: GL_RGBA16F: emissive R, G, B, primitive ID.

Texture 4 allows models to glow in the dark. The alpha value of texture 4 holds a semi-unique primitive ID used for SRAA. It’s then used as the accumulation buffer during lighting, with light contribution being blended into this texture (alpha is of course unmodified). During the SRAA pass, the scene is rendered again but the only thing written out is the same primitive ID as the one written to the G-buffer. The in the final resolve pass, I simply match the primitive ID’s in the SRAA buffer and the primitive ID’s in the G-buffer to get lots of generated color samples.

The original SRAA paper also mentions that it’s possible to instead store the normal and depth in the SRAA buffer and upsample by matching normal and depth instead of ID’s, but this requires thresholds and MUCH more math in the shader to do, so I think I’m going to skip that. The result I get right now is excellent, and I think using normals and depth instead would simply trade one kind of artifacts for another.

err, i’m thinking of simple transparent objects like windows or floating ui elements whatnots. do you use upsampling with softparticles and non-hard-edge objects in general ?

maybe i’m just confused with the upsample of the msaa gbuffer during SR-resolve. O_o.

i did some testing around with primitive-id’s too, but didn’t found a useful state yet. you’re using glsl buildin [icode]gl_PrimitiveID[/icode] ? writing the ID like (i <3 macros) …

#define write8BitPrimitiveID()  ( frag_primitiveID = float(gl_PrimitiveID & 255) / 256 )
#define write16BitPrimitiveID() ( frag_primitiveID = float(gl_PrimitiveID & 65535) / 65536 )

seems to be alright, …

but then - it’s per draw-call, so for instance : ID zero can be seen multiple times on the screen, which could overlap, which would screw with the “edge detection”. no ?

i still dont know if i get the SRAA technique …

is it about :

  • draw non-msaa colors, shade, postprocess, etc.
  • draw again, msaa- depth/normal/ID/other G-buffer - but not color, no shading,
  • process non-msaa color again,
    – fetch msaa-gbuffer-data, compute kernel weights (?)
    – apply 3x3 kernel to non-msaa colors (?)

:expressionless:

about the temporal component, have you read about https://developer.nvidia.com/sites/default/files/akamai/opengl/specs/GL_NV_sample_locations.txt yet ?

You’re rendering windows and UI to your G-buffer? o___O I’m talking about stuff like particles, transparent streak effects, etc.

@PrimitiveID
You can do a LOT better than that. First of all, I modify my ID’s based on a per-object value to randomize them a lot a better. Secondly, 16-bit floats are both signed and have floating precision, so the best way of doing this seems like it’d be to take advantage of the sign bit and also distort the values using exp2() to adapt to the dynamic precision of floating point values.

@SRAA implementation

[quote=“theagentd,post:7,topic:51443”]
well, yes and no. generally no, tho’ sometimes you can achieve some very psychedelic effects when messing with G-buffers like that. but no, i render transparent stuff on top of solids, not at lower resolution.

sourcing the primitive-ID from outside seems to be the way to go i guess. was wondering if you got the buildin variable to be any useful for that matter.

i’ve read the paper, just doesn’t light my bulb yet. :emo:

I found an potentially optimal way of storing the primitive ID. It’s possible to abuse the GLSL packing instructions to do this.


float value = unpackHalf2x16(objectID + gl_PrimitiveID).x;

This compiles to the following assembly code:

This should in theory do the following:

  1. Pick out the first 16 bits (the AND instruction).
  2. Interpret these 16 bits as a float value and convert this value to 32-bit.
  3. When the value is written out to the 16-bit render target, the value is converted back to 16-bit, presumably without any loss in information.

This should guarantee that all 16 bits in the render target are used optimally, but I’m not sure if this happens in practice.

that sounds very nice! required [icode]#version 420[/icode] tho’. i hope i can get away with a 8-bit ID buffer.

too bad integer-textures are not always up to the msaa-sample-count of float/depth-textures :frowning: (laptops usually).

edit

another way could be

float id = uintBitsToFloat( (gl_PrimitiveID << 15) & 0x007FFFFFu | 0x3F800000u ) - 1.0;     //  0.0 .. 1.0 range

or

float id = uintBitsToFloat( (gl_PrimitiveID << 15) & 0x007FFFFFu | 0x3F800000u ) * 2 - 3.0; // -1.0 .. 1.0 range

You really want the primitive ID to differ per primitive, so hardcoding it isn’t an option.

I forgot to mention that adding

#extension GL_ARB_shading_language_packing : require

makes it work on earlier GLSL versions on all vendors (even Intel o___o) as they all support it on their OGL3 GPUs and up.

I’ve identified a massive ghosting problem with the way I currently do things. I’ve identified the problem and I have a fix that I’m gonna implement.

First I’ll explain a bit more about how SRAA works, with pictures! =D

This is the coverage samples the rasterizer generates when MSAA is disabled.

If we enable 4xMSAA, we instead get the following pattern. As you can see, the rasterizer tests 4 points in a clever pattern to improve the quality, and as long as any one of these 4 samples pass, the pixel will be shaded and the result will be written to all these samples.

How does 4xSRAA work? SRAA first renders the scene normally without MSAA which does a coverage test at the center of the pixel (the dark yellow dot at the center), and then in the second SRAA pass it uses 4xMSAA which runs 4 new coverage tests (the 4 white dots). These 4 samples store no color at all. They only contain an triangle ID that can be matched to a nearby color sample from the first pass.

TSRAA adds temporal anti-aliasing. This means that we get the color sample from the previous frame to play with too! By cleverly jittering each frame so that each triangle ends up at slightly different places, we can actually get two color samples at different locations (assuming the camera is stationary). 4xTSRAA therefore looks like the following picture. Notice that there are now two color samples at separate places in the pixel.

So what’s the problem?

The whole point of TSRAA is to be able to provide ghosting-free temporal supersampling in addition to SRAA. The problem occurs when there’s an extremely finely tessellated model being rendered. In this case, when the object is far away there’s a risk we’ll get a different triangle covering every single sample that 4xTSRAA takes (both color samples and all 4 coverage samples). In this case, there’s no good solution since the coverage samples refer to triangles we have no color information for. We end up with a choice between two bad things.

  1. We can use only the current frame’s color sample, but this reveals the per frame jitter we’re doing and results in flickering.
  2. We can use both the current’ and previous frame’s color sample, but this can result in blatantly obvious ghosting.

This is the case where both SRAA and TSRAA completely fails, and we actually get even worse performance than we would’ve gotten if we had used no anti-aliasing at all. I modified my resolve shader to instead of picking either of these just output a bright red color instead. The result was horrifying.

The scene is filled with pixels that either flicker or introduce ghosting. The overtessellated player models are the worst culprits, but small details in the distance can introduce the same problem.

Sounds bad. What’s the solution?

I actually stumbled upon the solution when I played around with the number of SRAA samples. 4xTSRAA was actually the worse case scenario. 8xTSRAA has much fewer red pixels, as there are more coverage samples in total, so the chance that all 8 refer to unknown triangles is much lower. However, the solution lied with 2xTSRAA. With 2xTSRAA, there are no red pixels at all. The reason for this was actually a big coincidence: My two color samples ended up on exactly the same positions as my two coverage samples. Since the color and coverage samples have identical positions, 2xTSRAA is actually nothing more than temporal supersampling with the second MSAA pass being used to ensure that the previous frame had the same triangle ID.

So in essence, we need to ensure that our color samples are also backed by MSAA coverage samples to ensure that we can always at least match one or two of these. Now, I’m fairly sure that on OpenGL 3 hardware there’s no way to change where MSAA places its samples (it’s vendor dependent), but we can change the jitter to align our color samples with the closest MSAA coverage sample!

For 4xTSRAA we’d do the following.

For 8xTSRAA the following would be one solution:

With these changes, 4xTSRAA has the following quality properties:

  • When the triangle has color samples in both the previous and current frame, we get the quality of 2x temporal supersampling for shading and 4x MSAA for edges.
  • If the triangles are extremely small or thin, the coverage samples do not have enough color samples and quality drops to 2x temporal supersampling only.
  • Temporal supersampling is only used if no ghosting can be guaranteed, so no ghosting will occur.

Now it’s time to try implementing this! xD

The jitter modification for color sample position snapping worked well, but there seems to be some minor precision problems sometimes. 2xTSRAA and 4xTSRAA work like a charm, but 8xTSRAA causes some very rare pixels to get “red”. I’m unsure what’s causing this, but the problem never appears when the camera is static, so it does not pose a problem as I can simply use the current frame’s sample, as the jitter isn’t visible when moving for single pixels like this.

In addition, I’ve added a new feature to improve the quality for HDR. A problem in way too many games nowadays is that they do the resolve in the middle of the post-processing pipeline before tone mapping to avoid having to work on a multi-sampled image for expensive post processing. I need to do the resolve quite early as well, since I want to apply temporal supersampling before the transparent effects to avoid ghosting there. Since I do the resolve before the final tone mapping, there are problems where high-contrast edges meet.

Example:
We have 4 samples with the following (monochrome) colors:

  1. 1000
  2. 0
  3. 0
  4. 0

We average these 4 values together and get (1000 + 0 + 0 + 0) / 4 = 250. The problem is that 250 still ends up being completely white after tone mapping, as the first sample brings up the average so much that we won’t get a nice gradient. If we had tone mapped each individual sample instead, we’d have gotten this:

  1. 0.99999… something.
  2. 0
  3. 0
  4. 0

The average of this is much better, as we get ~0.25 instead. However, this is extremely hard to implement as it would require running the entire post-processing pipeline (if doing MSAA) per sample instead of per pixel. I’ve never seen a real graphics engine implement this without sacrificing post-processing quality.

Now, I came up with a trick that allows me to do the resolve before tone mapping while still maintaining an acceptable gradient in all but the most extreme cases. I’ve created a cheaper approximated tone mapping function that is similar enough to the one I actually use for the final tone mapping. When resolving, I run this function on each sample to tone map them. When resolving is done, I run the inverse of the tone mapping function on the result to restore a HDR value again. Although this isn’t perfect (bloom and other effects might be added on top of this value messing up the gradient slightly), it improves the quality to such an extent that we actually get a gradient. In addition, the bloom and other effects that are added on top here will actually mask the aliasing as well, so the result is always better than just averaging together the HDR values.

Here’s a screenshot comparison of doing the temporary tone mapping when resolving and simply averaging the HDR values together.
http://screenshotcomparison.com/comparison/95931

What is your heuristic for triangleID? Is it really unique per triangle? How about triangles sharing vertices?

No, it would be impossible to make it 100% unique, as I only have 16-bit precision for storing them, and much more than 65536 triangles. I use two components. First, a constant index for each model instance (which is kept constant for the life of the instance, not just for one frame to allow temporal filtering). Secondly, I use gl_PrimitiveID, which gives a unique (incrementing) value to each triangle being drawn without having to use a geometry shader. Add the two together, run modulo on it, pack into as much precision as you can get out of that 16-bit floating point value. You don’t need to convert them back to ints for the comparison in the resolve shader either.

http://screenshotcomparison.com/comparison/98800
Added jittered camera to get 2x supersampling like results. Not yet done with triangleId and only using simple depth based rejection.