Performance dropped due to the overhead of per-tile computations that I need to optimize. I currently recalculate all per-tile stuff for each pixel in the tile for simplicity, so the constant overhead per tile is 256 as high as it should be.
Noticed that you can approximate high freq AO by using normalized normal map z length. Basically this is free. Combined with baked vertex AO and temporal smoothed SSAO I have three different frequencies of ao all playing nice together.
This is just a test program for my engine, but itās being used by WeShallWake.
My engine is using deferred shading. That means that lighting is split up into two passes. The first pass renders all the models to write the diffuse color, normal, specular intensity, roughness and glow color into a massive G-buffer (4 render targets + depth buffer). In the second pass, I try to figure out which pixels that are affected by which lights, read the lighting data stored in the first pass for those pixels and accumulate lighting for each pixel.
Traditionally the second pass has been done by rendering light geometry and using the depth bounds test to cull pixels that are too far away from the light to be affected by it. For a point light, I render a sphere. This is what the light geometry looks like. You can clearly see where the light sphere is intersecting the world.
This has two main problems. The first has to do with the depth test. Consider the worst case scenario where youāre looking straight up into the sky and standing inside the light volume. The volume is covering the whole screen, but not a single pixel will pass the depth test. Thatās 2 million wasted depth tests for a 1920x1080p screen. Basically it doesnāt matter if there is actually any geometry intersecting the light volume; you still have to run the depth test to figure that out, which although fast isnāt free. Note that the screenshot above does not show the pixels that failed the depth test, only the pixels that actually ran light computations. Around 2/3rds of the pixels were wastefully filled and failed the depth test. This can lead to extremely bad performance, especially when youāre standing inside multiple lights, regardless of how small they are.
The second problem has to do with overdraw. The lighting data needs to be read from the G-buffer and unpacked every time a light covers a pixel and passes the depth bounds test. I also need to reconstruct the eye space position for each pixel. If 10 lights affect the same pixel, this data is read and unpacked 10 times, again leading to possibly bad performance. Itād be better if we only had to unpack this data once for all lights. What weāre essentially doing is this:
The solution to these two problems is tile based deferred shading. Instead of rendering light volumes, we split up the screen into 16x16 tiles. We upload a list of lights to the GPU and have the GPU compute a frustum for each tile based on the minimum and maximum depth of each tile. We can then check which lights that affect a certain tile and only compute lighting for these lights. The resulting light overdraw looks like this:
You can see that it detects the overall same light shape as the light volume technique, so why is this more efficient? First of all, we test visibility per tile, not per pixel. Before, we ran a depth test per covered pixel, which could be anywhere from 0 to 2 000 000 pixels, with the worst case scenario happening when you stand inside the light volume (a very common case). With tiles, we do a constant number of tests, 1920x1080/(16x16) = 8100 tests per light, regardless of how much of the screen they cover. Itās clear that the worst case scenario is much better here. Secondly, we only need to unpack the pixel data when loading in the tile. The pseudocode now looks like this:
Itās clear that this is much more efficient. Fine-grained per-pixel culling of lights isnāt necessary, so culling them per tile has much better performance. For the actual lighting, the inner loop is now over lights, allowing us to read in the light data once and reuse it for all lights. The 16x16 tiles are also a very good fit for GPU hardware, as GPUs process pixels in groups. This technique was first used for Battlefield 3, and requires OGL4 compute shaders for an efficient GPU implementation.
EDIT: The overhead of computing the tile frustums is currently around 3ms for a 1920x1080p screen, but it is currently doing 256x as much work as it has to. Hopefully I can get it down to at least 0.5ms. Despite the 3ms overhead, I can create scenes with lots of lights were tiled deferred shading is much faster than light volumes, 12.5ms vs. 17.0ms.
Alright, Iām starting to understandā¦ But what I donāt really get is this saving data onto the graphics card stuff. I know you save the depth buffer, normals, specular intensity, and diffuse color into a G-buffer. I just donāt understand things like what data is actually going into that buffer. (Like images? Is it pixel data? Just bits?)
Also, where the GPU tests for the light pixels to be in the tileās frustum, is that done through a custom shader? How do you tell a GPU to compute that? If thereās a tutorial, (preferably book / lecture) please let me know.
The data is stored in textures. I simply set up MRT rendering (multiple render targets) using a framebuffer object to render to 4 color textures plus a depth texture as depth buffer. The texture layout looks like this:
I then just read this data using texture fetches in the lighting shader. Note that the normal is packed using sphere mapping so that it only takes two values instead of three.
Constructing a tile frustum is pretty similar to doing frustum culling on the CPU. You extract the 6 planes and check the signed distance to these planes. This should be implemented on a compute shader to be efficient.
I was working on the 5th iteration (Or rather: The 5th āgreat code rewrite/refactorā) of my Voxel-Engine today.
This is the 1 Month birthday of the iteration, and I am working almost every second or third day on it, which is a new record of ānot giving up on it after a monthā!
The current result:
14.426 lines of code.
Uses Gson, LWJGL, EventBUS, and KryoNET.
Has a basic GUI-System.
Automated Asset Manager.
āInfiniteā Bitmap FontRenderer.
Both Fixed-Function and Programmable-Function āpipelinesā for OpenGL.
FINALLY got the hang of all the right patterns.
And some more stuff ā¦
Screenshot of the bare-bones (yet)non-functional Main-Menu:
You sort of get pulled in to a trance staring at them, the only thing that snaps you out of it, is the transition between the end and the beginning
Thought about making a live wallpaper with them?
I would if it wasnāt so computationally expensive, it takes almost all of a pretty strong CPU to render them at a decent frame rate, so mobile is out of the question.
Each frame you see there is 600K plotted points of the attractor.
Source is updated though, much cleaner (and faster), Iām actually sort of proud of how it runs. https://gist.github.com/BurntPizza/c6f4c7f18daa9950692c
(Adjust sleep in RenderFrame.run according to your compās power, I havenāt got automatic throttling working yet)
i did absolutely nothing
i feel so empty
i canāt say itās bad cause i know how does it feels to feel very bad
new life experience
but i canāt say cool cause emptiness is not cool
Some developments in this areaā¦ Battledroid turns out needs OpenGL3 level hardware to get the performance we need without having to write some annoying extra code paths, and I think weāve covered >90% of the customer base so 3.0+ or better it is for now. The benefits of glacial development speed is that old hardware gradually disappears So Iām still slowly diddling away on Battledroid and Chaz and Alli are working on a secret projectā¦ in Unity. (I was also working on Skies of Titan, another Java arcade game, but Chaz is never going to have the time to do both, so we decided on the Unity one as Battledroid needs progress)