GLSL tricks

Todays GPUs are very powerful but it’s important to understand the limitations of the hardware of GPUs. For example, branching in GLSL is very expensive due to the way that the stream processors on GPUs work. In many cases branching causes both branches to be executed and the correct result is then picked afterwards.

A general tip when coding shaders is to use the built-in functions as much as possible. They are always faster than manually doing the calculations.

  • Never manually normalize a vector by calculating the length of it using a square root and dividing it by it. Always use normalize().
  • Don’t use branching to clamp values. Use min(), max() and clamp() for that.
  • A very common function is linear blending and there’s a function called mix() for it.

Generating random numbers

Generating random numbers on a GPU in a traditional way is impossible since we can’t use a global seed (well, we can in OGL4+ using atomic counters, but I wouldn’t count on good performance). Random numbers can be useful to introduce noise to counter banding in algorithms like HBAO (randomly rotate the sampling ray) or volumetric lighting (random offsets) to trade banding for noise which is much harder to spot and looks better when blurred. This one-line function is a pretty simple noise function seeded with a 2D position (you can use the screen position or texture coordinates as the seed).

Generating random numbers on the GPU presents a couple of challenges. The first is that from a practical standpoint you start with some non-random data (say a texture coordinate) which needs to be hashed to give a “random” starting value. The second is that most GPUs currently in use are very slow at integer computations which are invaluable in hashing and generating PRNGs. The results needing to perform very hacky hashing and random number generation entirely in floating point until your low end target has full-speed integer support.


float rand(vec2 co){
    return fract(sin(dot(co.xy ,vec2(12.9898,78.233))) * 43758.5453);
}

Permutation polynomials. In use examples: (Local value noise, gradient and simplex noise)


// repeated for other type
vec2 mod289(vec2 x)     { return x - floor(x * (1.0 / 289.0)) * 289.0; }
vec2 permute(vec2 x)    { return mod289(((x*34.0)+1.0)*x); }
vec2 rng(vec2 x)        { return fract(x*1.0/41.0); }

// minimal example: take a 2D coordinate and convert into a hash value, then
// generate multiple random numbers by rehashing the hash.
float bar(vec2 x)
{
  float h, r;
  vec2 m = mod289(x);             // values must be bound to (-289,289) for precision

  h = permute(permute(m.x)+m.y);  // hash the coordinates together
  r = rng(h);                     // first random number
  ...
  h = permute(h);                 // hash the hash
  r = rng(h);                     // second random number
}

Dot products

The dot() function is used to calculate the dot-product of two vectors, which is the same as multiplying the vectors component-wise and then adding them together. For a 3D vector, that means that [icode]dot(v1, v2) = v1.xv2.x + v1.yv2.y + v1.z*v2.z[/icode]. This is a very useful function for doing many things. For example, calculating the distance between two points using Pythagoras’ theorem:


vec3 p1;
vec3 p2;

//...

vec3 delta = p1-p2;
float distSqrd = dot(delta, delta); //Distance^2, can be useful for lighting which saves you the square root
float dist = sqrt(distSqrd); 

Converting a color to grayscale:


vec3 color;

//...

float grayscale = dot(color, vec3(0.21, 0.71, 0.07));

Shadow mapping

Shadow mapping is basically a software depth test against a shadow map. The shadow map coordinates are interpolated as a vec4, so we need to do a w-divide per pixel, get the shadow map depth at that coordinate and compare it to the pixel’s depth. A simple implementation does this:


uniform sampler2D shadowMap;

float shadow(){
    vec3 wDivShadowCoord = shadowCoord.xyz / shadowCoord.w; //z-divide

    float distanceFromLight = texture(shadowMap, wDivShadowCoord.xy).z;
    
    return distanceFromLight < wDivShadowCoord.z ? 0.0 : 1.0;
}

This is not optimal. By using the function called step() we can eliminate the branch by just writing [icode]return step(wDivShadowCoord.z, distanceFromLight);[/icode] instead.

Even better, the GPU can do the shadow test for us in hardware with some basic shadow filtering if we use a sampler2DShadow instead of a normal sampler2D. That way we just feed in the xyz w-divided shadow coordinates into it. On the shadow map, set up the following parameter to enable hardware shadow testing: [icode]GL11.glTexParameteri(GL_TEXTURE_2D, GL14.GL_TEXTURE_COMPARE_MODE, GL14.GL_COMPARE_R_TO_TEXTURE);[/icode] and change the sampler type to sampler2DShadow. It’s also possible to enable GL_LINEAR as the texture filter and get 4-tap PCF bilinear filtering.

There is one final optimization. Not only can the GPU do the shadow test in hardware with filtering, it can also do the w-divide in hardware using [icode]textureProj()[/icode]! It can’t get better than that!


float shadow(){
    return textureProj(sampler, shadowCoord);
}

We get better performance, better image quality thanks to the PCF filtering AND a simpler shader. However, the first shader is extremely fast anyway, so why optimize it this much? Shadow filtering. To get smoother shadow edges you do lots of shadow tests on nearby pixels in the shadow map, usually 8 to 16 of them. In that case we would’ve gotten 16 branches, not just one, so eliminating them means a lot here. Using hardware filtering also gives you 4 samples per texture lookup instead of just one, allowing you to sample a bigger area.

GLSL Gotchas

[*]Array declaration is broken on Mac Snow Leopard[1]

I’ll add some more RNGs, but in my option atomic counters is a massive red herring. The current problem is that integer performance on most folks card is awful. Once integer performance is good enough you can do the same thing as on the CPU for procedural content…use a hash either for the whole local RNG, or hash to seed a standard generator.

OK, added a permutation polynomial RNG and some references (include a value noise in shared source)…I’ll additionally add a Weyl generator (probably on the PRNG page, as it’s interesting there as well). Also stuck some alternate verbage on RNG (in itallics) that I’ll let the original author do with as they wish.

Since atomic counters are only supported by OpenGL 4.2 cards. Are you sure integer performance is bad for that kind of hardware? I’m pretty sure OGL4 hardware can at least get a seed for a random number generator rather quickly. I think the main problem is the atomic operation, not integer performance, but I haven’t tried it out yet (my main PC only has OGL 3).

I was the original poster, but I think you already knew that. =S Your point on random numbers and integer performance on low end hardware is good. However I’m a bit confused by your example code since it starts with


// bad example
float bar(vec2 x){
    ...

I assume you mean that that exact use case is a bad example, not the emulated hashing part?

I’m not clear at what point that integer performance has become reasonable…since it isn’t on any of my cards I haven’t worried about it. I’m assuming it’s still pretty wide spread since FlipCode recently had a link to a blog from an engineer at either nvidia or amd about emulation of some integer ops in float. I’d have some concern about accessing an atomic counter…it seems like it must be a serializing instruction which (again) seems like it must bring a bunch of processors to a halt and therefore should only be used sparingly. With full performance integer operations however hashing isn’t too bad and neither is PRNGs of OK to excellent statistical quality. Purely in float, they pretty much all blow chunks, with the exception (to my knowledge) of nested-shifted-weyl generators, which is a bit expensive. (but let’s not place undue emphasis on statistical quality)

I intended to say “this is a bad example…but hopeful enough that you can use this.” But it is a bad hashing function…but again sufficient for a number of purposes.

Well, the simple random function I posted turned out to be good enough for me.

Here’s volumetric lighting done by ray-marching with 32 samples.

http://img811.imageshack.us/img811/5041/badvolumetriclighting.png

That banding is horrifying! Adding offset per pixel to the sampling depths trades the banding for much less noticable noise, working much like almost perfect dithering making it much harder to see that only 32 samples are used for this.

http://img820.imageshack.us/img820/9216/goodvolumetriclighting.png

I plan on adding a blur to the effect too which will completely hide the noise with a gaussian blur with a radius of just 2 (3 optimized taps, two passes). The 15% performance hit of the good version is not due to the random function per se, but due to the lower cache coherency of the texture samples. Lowering the shadow map resolution reveals that the actual shader arithmetic performance difference is around 985 vs 1000 FPS of the non-offset version at 720p with 32 samples, obviously well worth it considering the gains.

Yeah hash/PRNG quality is frequently a red herring. Speed is a different story.