Parallel/concurrent shader program loading

TL;DR / SPOILER ALERT: You don’t seem to be able to do stutterless shader program “streaming” in OpenGL. ARB_parallel_shader_compile is completely broken. The solution with the least amount of stuttering was using cached binaries.

Today I investigated parallel shader loading. Basically, the idea is to speed up loading time by compiling multiple shaders in parallel, and optimally to do this while the game is running without causing stuttering.

In OpenGL, compiling a Shader object is actually a very cheap thing to do. This step pretty much just does some rough validation and makes sure the GLSL code is valid, but doesn’t actually generate any runnable code. For a very big shader (shaders with massive loops to unroll for example), this usually takes <0.5ms per shader, ~0.2ms on average. Even more amazingly, if you compile a shader on a separate thread with a shared context it doesn’t seem to interrupt rendering at all. This is nice, because it means we can do it in parallel to the game rendering without causing any stuttering.

The reason why Shader objects are so cheap to compile is because the actual generation of binary code is done when linking a Program object. This allows the driver to for example optimize away outputs of the vertex shader that aren’t used by the fragment shader for optimal performance. This part is significantly more expensive, often costing more than 100ms per program to link. Therefore it makes a lot of sense to try to attempt to do this concurrently while the game is running to show loading screens and information while linking the shaders. This is however difficult. Each call to glLinkProgram() seems to freeze the ALL OpenGL contexts, which the rendering context getting stuck for long durations when swapping buffers. This causes a significant amount of stuttering.

An interesting note I made was that the driver does some EXTREMELY aggressive caching behind the scene. If the same shader code has been compiled before or a set of shader code has been linked together before, then compiling/linking the same shader code again is much faster. This increase in speed remains even when recreating shader objects, and even after restarting the program! Compiling a cached shader takes less than 0.1ms, while linking a cached program takes around 2ms (50x faster!). This caching is helps a lot in improving the load time when loading the same shader over and over, like a player would. It helps less for developers that modify their shaders a lot, but obviously that’s not that big of a problem. However, even when the caching kicks in there is still significant stuttering caused by glLinkProgram.

There’s an extension called ARB_parallel_shader_compile which gives you a mean to provide a hint to the driver that you want it to compile and link using multiple threads. Only Nvidia “supports” this extension (I’ll explain the quotes below) so far, but it was worth trying out. It only adds two things: The ability to give the driver a hint of how many threads it should use to compile/link with, and a way of checking if a shader/program has completed compiling/linking without actually waiting for the result. This should in theory allow us to compile much faster by using multiple threads, and possibly also remove the stuttering. However, my findings really sucked.

First of all, when supported ARB_parallel_shader_compile should be “enabled” by default (the default is “implementation chooses number of threads”), but Nvidia seems to require you to set the number of threads before it actually starts doing anything (even if you just set it to the default value). This was a bit annoying but easy to work around. Now, what this extension seems to do in the Nvidia implementation is that it causes glLinkProgram() to no longer block. Instead, the program will only block if you do something that requires the result of the shader (checking compilation result, getting compilation log, getting uniform locations, binding the program, etc) before it is ready. The idea is to fire off all your glLinkProgram()s without querying the result, and then waiting (drawing loading screens or whatever) until the new method for checking if the linking says its done. However, this is broken beyond belief:

  • It’s unreliable. Half the time the driver ignores the preferred number of threads and just blocks when glLinkProgram() until the compilation is complete.
  • The driver doesn’t actually start compiling anything until you actually do something that needs the result of a program link. In other words, you can fire off all glLinkProgram()s, sleep for 10 seconds and then try to check the link result (if it failed or succeeded) only to see the driver actually starting to link at that point instead of when glLinkProgram() is called, giving you the same stuttering but from glGetProgrami() instead.
  • You can trick the driver to start linking all “queued” programs by getting the link result of the first program you linked, causing the other programs to continue linking in the background. If you wait until the rest are done before querying any more programs, you won’t get any noticeable amount of stuttering, besides the first program you linked…
  • … except that’s impossible to do, because the new function for querying if the linking is done is broken on Nvidia’s implementation. It always returns false, so it’s impossible to determine if the program is done linking before getting the link result. Since the time it takes to link a program differs so much depending on caching, it’s impossible to predict how much time this will take too, making sleeping for a certain amount of time infeasible too.

In other words, it’s completely useless.

Another thing I tried out was getting program binaries and loading those instead. This removes the need for compiling Shader objects and attaching them to Programs, allowing you to “link” a full program by just throwing it a binary file instead. When using this function, the ARB_parallel_shader_compile extension seemed to be completely ignored, with all “linking” happening in glProgramBinary(). However, you would always get the same performance as if the driver had fully cached the program you tried to link (AKA ~2ms/program) instead of the 100ms+, without the risk of the program getting evicted from the cache or anything like that. There was still a considerable amount of stuttering in the rendering while loading with it though. Still, at a cost of only ~2ms per program it’d be possible to load a couple of programs per frame without getting any dropped frames @ 60 FPS.