GPU blur performance analysis

Blurring is a very useful effect in games, often used in numerous postprocessing effects, but can also be very performance heavy. Hence, optimizing it as much as possible is often important to allow for big blur kernels.

In this article I have tested four different blur algorithms in an attempt to help people pick the best blur algorithm for their particular use case. For this test, I have used a simple box blur for simplicity, but all of these techniques can be used for Gaussian blurs as well and all yield the exact same results (bar floating point/render target rounding errors). The four techniques tested are:

  • Naive blur: The naive NxN blur shader naively does an NxN kernel blur. It does exactly N^2 texture samples, but is the most straightforward way of doing a blur. It only requires a single pass on the data.
  • Separable blur: The separable NxN blur shader splits up the NxN kernel into two N blur passes. It does exactly N*2 texture samples, each of the two passes doing N texture samples.
  • Naive linear blur: The naive linear NxN blur shader works very similarly to the naive NxN blur shader, but uses hardware linear filtering to read up to 4 values per texture sample. It therefore requires less texture samples than the naive NxN blur shader, (N+1)/2^2 texture samples and only requires a single pass on the data.
  • Separable linear blur: The separable NxN blur shader splits up the NxN kernel into two N blur passes, using linear texture filtering to read up to two values per texture sample. It does exactly N+1 texture samples, each of the two passes doing (N+1)/2 texture samples.

There are 3 main points to take into consideration when choosing which blur algorithm to use:

  • Kernel size: As the kernel size increases, the naive blurs become less viable as their their bigger dependence on the size of the kernel makes them slower on the better scaling separable blurs.
  • Render target bit-depth:: The bit-depth of the render target affects the performance of the blur, mostly due to a higher write cost. This heavily affects the overhead of doing two passes for the separable blurs, while the texture sample cost isn’t as heavily impacted thanks to the texture cache.
  • Bilinear hardware acceleration: Some render targets (read: 32-bit float texture formats) are not able to do full-speed linear filtering, so the algorithms that rely on linear filtering may perform worse for those render targets.

Since the performance depends on the render target format/bit-depth, I have run separate benchmarks for three different bit depths. The benchmarks also include a baseline, which is the cost of simply copying the render target once. This can be thought of as the minimum overhead of an additional pass.

32-bit render targets (GL_R11F_G11F_B10F / GL_RGB8 / GL_SRGB8)

A relatively small 32-bit render targets with full speed bilinear filtering means that the overhead of the extra pass of the separable blurs is small, so the separable linear blur dominates. The only exception is for 3x3 blurs, in which case the naive linear blur shader is much faster. This is because they both read the same number of total samples, four, but the separable blurs requires two passes which gives them too much overhead compared to the cost of the texture samples.

64-bit render targets (GL_RGB16F)

For the bigger 64-bit render targets, the overhead of the additional pass for the separable blurs becomes greater, favoring the naive blurs more for smaller blur kernels. In this case, the naive linear blur wins against the separable blur for both 3x3 and 5x5 blur kernels. Linear filtering is still done at full speed on all major GPUs.

128-bit render targets (GL_RGB32F)

If you for some crazy reason find yourself in a position where you need to blur a 32-bit floating point format, here you go. For 32-bit floating point render targets, bilinear filtering is done at half speed. The naive linear blur shader can still extract some performance despite that as it reads up to four values per sample (although it’s still slower than the others), but the separable linear blur shader suffers since it only reads up to two values per sample, leaving it slower than the separable blur shader. In the end, the naive blur shader wins for 3x3 kernels, while the separable blur shader wins at 5x5 and above. You’re probably much better of doing some fancy compute shader blur if you really need a 32-bit float blur.

Result matrix

[td]32-bit format[/td][td]Naive linear[/td][td]Separable linear[/td][td]Separable linear[/td]
[td]64-bit format[/td][td]Naive linear[/td][td]Naive linear[/td][td]Separable linear[/td]
[td]128-bit format[/td][td]Naive[/td][td]Separable[/td][td]Separable[/td]

Really interesting! Are you happy to share the benchmark code?

Sure, although there are quite a few dependencies on my own utility classes (Texture2D, ShaderProgram, etc), but they’re not too difficult to replace I guess.

Java code: