Voxel Rendering Techniques

KaiHH · October 18, 2018, 8:07pm

I wanted to just find some time to do a quick write-up of all the rendering techniques I’ve been exploring the last month for voxel rendering and light transport for such scenes.

My approach to voxel rendering can be decomposed into two parts:

Solving light transport in the scene (how much light does each voxel receive (i.e. irradiance))
Actually efficiently rendering the voxels onto the screen

Let’s start with 1.:

I started out with irradiance caching techniques, basically digesting all the papers/presentations from: http://cgg.mff.cuni.cz/~jaroslav/papers/2008-irradiance_caching_class/ and others. This basically means, that light transport is not constantly recomputed for each eye ray when it hits a surface, but instead those computations are cached in some form of spatial data structure which is then indexed-into during rendering.
Since I only store irradiance on the surface of voxels and not at all possible (regular) positions within the scene, for my purposes, this data structure can actually be the voxels themselves.
First, I used a single sample per voxel side, but that quickly and obviously resulted in ugly light discontinuities between nearby voxels and very unrealistic (as in “not at all”) shadows at edges between two adjacent voxels. Even Minecraft looked better than that. So, to facilitate nice shade interpolation including soft shadows at edges between two voxels, I ended up with storing individual irradiance values at the four corners of each of the six faces of a voxel.
In my implementation, the irradiance is also not computed along with on-screen rendering, but is computed completely off-screen and then only sampled during rendering. Soon came the realization that there needed to be some form of LOD (level of detail) to avoid computing all those irradiance sampels at equal frequencies. For a typical 256x256x64 voxel cluster, which in the end could have around 32.000 visible voxels (yes, in the voxel generation phase I already removed voxels having other adjacent voxels at all sides), that meant computing 32.000 * 6 * 4 = 768.000 irradiance samples. And that is only one sample per vertex per face per voxel. A believable scene needed around 32 samples per vertex, so a total of around 25 million samples, and it was still very noticeable when another sample accumulated to that over time.
But clearly, there was no necessity to spend equal amounts of compute resources to all voxels alike. We could sample voxels nearer to the viewer more frequently. So, this is what I did as well. But still, this was still too much work per frame and required some seconds of pre-processing where only irradiances were sampled for the voxel cluster the player would start in.
Next came other LOD mechanisms which simply distributed the cost of a single dispatch compute over more frames:

sampling only every N-th voxel
sampling only a single voxel face
sampling only visible faces (faces not adjacent to another voxel) (I will come later to why this actually is not a performance improvement when done naively!)
While this decreased frame time significantly, it also effectively slowed down irradiance sampling and made changes in lighting very noticeable whenever new samples accumulated slowly. We needed other ways to reduce computation. But more about that later.

Let’s just quickly continue with part 2. from above: (actually rendering the voxels):

In addition to optimizing irradiance caching there was also the other part of rendering, which was… well… actually rendering the voxels. This started off with completely ray-traced cubes using a AABB-based bounding volume hierarchy with Morton Code sorting as spatial acceleration structure, and then switching to a kd-tree with surface-area heuristics which improved rendering time.
Then there came the jcgt paper “A Ray-Box Intersection Algorithm and Efficient Dynamic Voxel Rendering” which made me change the method of generating primary rays from ray tracing to a hybrid “point-sprite rendering and AABB/ray intersection test rendering” which yet again improved rendering performance. Other experiments have been done using the geometry shader to generate the convex hull polygon of a screen-space projected box, which again improved rendering times for close voxels.
Since rendering now was done using a rasterization algorithm, there were other typically rasterizer-friendly optimizations that became applicable now, such as frustum culling (which is really easy to do when already having a spatial acceleration structure) and occlusion culling.
The former I implemented with the kd-tree and glMultiDrawArrays. The latter is more involved and I started with a Hi-Z occlusion culling pass and a GPU-driven rendering approach using glMultiDrawArraysIndirectCount, using a compute shader to perform hierarchical culling driven by the kd-tree nodes using the previously generated Hi-Z mip chain to do occlusion culling. That part is currently still in work.

Now let’s get quickly back to the first rendering part: irradiance caching. Distributing the work to sample irradiance for all voxels over multiple frames and doing that more often for nearby voxels is all good and well, but still too much to do. The optimal solution would be to determine the actual set of visible voxels, which by experimentation in a typical scene with the player standing on the ground can be around 10 - 6.000 voxels, depending on the view distance and the terrain. That on average is far better than the initial potentially visible 32.000 voxels of the whole voxel cluster. So, how do we determine the set of visible voxels?

The idea is to use the rasterizer and a simple render pass (at possibly half the resolution - assuming a visible voxel needs to conservatively account for at least 4 fragments) to first render the scene with depth write/testing enabled and using a simple uint render buffer to store the ID of the voxel for each fragment. Z-testing will ensure that only the IDs of the closest visible voxels will remain in the render buffer. Then, we do a full-screen quad render pass with a simple fragment shader sampling the voxel ID from the render buffer/texture and writing a flag/byte into a Shader Storage Buffer Object (SSBO) at the position/index of the sampled ID. As the voxel ID we simply use the gl_VertexID to avoid having to drag in an additional uint vertex attribute for each voxel.
Since the SSBO we write to only needs to be as large as there are actual voxels in the voxel cluster currently being processed, we only need roughly a 32KB SSBO.
After the full-screen quad pass, and a memory barrier of course, we end up with an SSBO which contains a 1 at every index of a visible voxel and a 0 everywhere else. Here you can ask: Wait, why do we need a full-screen pass to finally sample and write the visible voxels’ IDs. Can’t we do that in the initial voxel rendering fragment shader by using early-fragment tests and letting the fragment shader only execute for visible voxels? Actually, no. First, we would need to sort the voxels from front to back, which would be doable, and then the fragment shader would only execute for actually visible fragments. But sadly, the point sprite rasterization approach needs to kill/discard fragments and also write to gl_FragDepth, so early-fragment tests cannot be enabled. This is also sad in particular because later we cannot use the Turing extension GL_NV_representative_fragment_test to have the fragment shader only execute for a single voxel fragment. So, we end up with a final full-screen quad pass which fills the SSBO with the visibility flags.

This can then be used to efficiently answer the question: “Is voxel with ID X visible or not?”
But we need an efficient answer to another question: “What is the set of voxels that are visible?”
Why do we pose the question this way? We could just loop over all 0…N voxel IDs and ask the first question at every such index.
This has to do with how we are going to dispatch a compute grid to only sample irradiance for those voxels. While we could launch a compute grid over all voxels in the cluster and branch out for invisible voxels (which we query from the SSBO using the first question), this will in the worst case not improve performance in any way compared to the “just compute irradiance for all voxels”, because of reduced occupancy of the thread groups. If for a warp of 32 threads only 1 thread is used to compute one visible voxel, due to the lock-step SIMT execution, the thread group will finish in the same time as it would, had all 32 threads been computed. And since we blocked 31 of the 32 threads of the warp from doing other useful work, such as computing irradiance for other actually visible voxels, this approach is bad.

A better way is to collect the IDs of the voxels in a compact way into a buffer. So, we need to filter our list of all voxels to only retain the visible voxels. Given a “predicate” list of flags that tell whether a voxel at index X is visible (in our SSBO), and given the list of all voxels, filter the list of voxels to only contain those voxels for which predicate(X) == true/1.
There is a popular algorithm that allows us to do this in parallel, which is called Parallel Prefix Sum or simply “Scan”. For more information on that and the article which I used to base my implementation off of, see: https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html
Actually, my implementation is an adaptation of the ideas mentioned in the section “39.2.5 Further Optimization and Performance Results” optimized for filtering with 8-bit inputs (loading 32 elements per thread instead of only 8 as has been done by David Lichterman) together with the ideas mentioned in “39.2.4 Arrays of Arbitrary Size” to allow computations over multiple thread groups.

Parallel Prefix Sums can be used to implement filtering and stream compaction. I use two compute shaders for that where the predicates buffer is filled by the mentioned full-screen rendering pass writing a byte to the SSBO at the index of a visible voxel. The first compute shader pass then computes the prefix sum for all thread groups, writing the partial per-thread-group results into a temporary SSBO and writing the total per-thread-group sum into another temporary SSBO. The second compute program then performs another prefix sum over the total per-thread-group sums and writes the final filtered voxels at their corresponding positions into an output result SSBO.
This result SSBO then contains only the visible voxels tightly packed. This allows us to launch a compute grid over only those visible voxels!

This might have been a pretty quick fly-over over the ideas and algorithms so far, and alot of important stuff hasn’t even been mentioned, such as:

how to actually allocate and order the voxels for efficient glMultiDrawArrays rendering when traversing the kd-tree
how to make use of vendor-specific OpenGL extensions to improve memory footprint and shader runtime
how I actually used directional irradiance cached Hemispherical Harmonics to implement view-dependent specular lighting
how to do ray tracing against a kd-tree without a stack
how to build a kd-tree with surface-area heuristics
why atomic bitwise operations are not a good idea to save bandwidth
I will try to detail this more in followup posts.

landocalrissian · October 18, 2018, 10:28pm

I’m wondering if you have looked into voxel splat rending similar to what Media Molecule has done in Dreams?

KaiHH · October 20, 2018, 12:27pm

Thanks for mentioning those slides. Was certainly an interesting read.

Yesterday I spent some time to massively improve performance of the k-d tree tracing shader making >1000 FPS (< 1ms./frame) 4K rendering possible (only the tree traversal and the box/ray intersector, without any shading currently). Here are some things which will immediately improve your shader too :

always combine multiple scalar SSBO struct members into vectors and use swizzle access (this was the biggest perf. improvement!)
never pack a big struct with all the attributes you are probably going to need in your spatial acceleration structure into a big interleaved SSBO, but separate the attributes by frequency of use (depends on the algorithm of course) into different SSBO bindings (my voxels and k-d tree currently consists of six individual SSBOs which contain information accessed in different phases at varying frequencies of the tree traversal)
always check whether different ordering of SSBO struct members can improve overall struct size due to better offsets for memory alignments
only use 8/16-bit GLSL types for storage (as SSBO struct member types) and never for actual computations in the shader where frequent costly casts then become necessary

Here is some 4K video showing some debug colors of the k-d tree traversals (red channel = number of “ropes” followed, green channel = number of descends from internal node to leaf nodes). Further videos will now be in 4K and that is somewhat my target resolution for high-end systems:

v0R49NLx7mM

Next I will have a look at the NV_gpu_program5 assembly text for the shader and try to reorder/optimize the GLSL code for better instruction usage before actually going down the road and use assembly text myself.