Optimizing Performance

Recently I’ve been using deferred rendering and learning about a lot of post processing effects and applying them. However, the game runs (on a 5 year old laptop) at 60 fps with a somewhat simple scene.

Right now I am using bloom, depth of field, fog, FXAA, HDR, and deferred lighting on a scene with skeletal animation, a skybox, and basic SAT collision detection. Is the current performance to be expected or is it likely that the shaders need to optimized?

What I’ve learned about deferred shading (and what theAgentD told me) is that deferred shading makes the engine run at a more constant framerate. So the low FPS areas are brought up, and the higher FPS areas are brought down. This is compared to forward rendering, of course.

Though, it won’t hurt to optimize your shaders :stuck_out_tongue:

Thanks, I just wanted to make sure that I wasn’t screwing anything up. I’m probably going to time the shader passes and see if any are taking up a noticeable amount of time per frame.

Seems that everyone is bugging theagentd with his problems xD

If your engine … or your rendering runs efficiently depends on many things. For example on the scene complexity, the gbuffer resolution, the instruction count of your post effects etc. My (subjective) feelings are, that if you are talking about a GPU of the kind 540m, rendering at a 720p resolution with animations and lots of post effects, 60 fps is okay. Don’t forget to turn off framerate limiters, like vsync, do you? Nonetheless, measuring is the only thing you can do - without measuring, you can’t optimize anythin, as you probably know.

Your first step should be getting basic GPU profiling working. See this thread: http://www.java-gaming.org/index.php?topic=33135.0. Using that, you can get the exact time your different render passes and postprocessing effects take, which will allow you to see where you should focus your efforts.

Once you’ve figured out your bottleneck, you can start optimizing it. If you find anything that stands out, I can help you diagnose what’s making that particular part slow.

I ran the profiler and got the following results:

[quote]Frame 1063 : 13.555ms
Geometry : 6.223ms
Skybox : 0.648ms
Terrain : 5.439ms
Lighting : 2.362ms
Bloom and HDR : 1.113ms
FXAA : 1.957ms
Depth of Field : 0.491ms
Fog : 0.911ms
Final Render : 0.487ms
It seems the most noticeable slowdowns are for terrain (a 2000x2000 mesh broken into triangles every 10 units) and lighting (that 2ms is one point light and ambient light).

I know one optimization would be using light volumes as opposed to fullscreen passes, but are there any other obvious slowdowns or mistakes in my lighting shader? (The shader code is based on JMonkeyEngine lighting code)

#version 330

uniform sampler2D diffuseTexture;
uniform sampler2D normalTexture;
uniform sampler2D depthTexture;

uniform vec3 lightColor;
uniform vec4 lightPos;
uniform vec3 viewPos;
uniform vec4 lightDirPacked;
uniform float lightRadius;
uniform int directional;

uniform mat4 invProjectionMatrix;
uniform mat4 invViewMatrix;
uniform float near;
uniform float far;

in vec3 pass_Position;
in vec2 pass_TextureCoord;

out vec4 out_Color;

const float kPi = 3.14159265;
const float kShininess = 16.0;
const float kEnergyConservation = (8.0 + kShininess) / (8.0 * kPi);

float getAttenuation(float distance) {
	if (distance > lightRadius) {
		return 0;
	float x = distance / lightRadius;
	return 1 / (1 + x * x);

vec3 reconstructPosition() {
	vec4 clipSpaceLocation;
	clipSpaceLocation.xy = pass_TextureCoord * 2.0 - 1.0;
	clipSpaceLocation.z = texture2D(depthTexture, pass_TextureCoord).r * 2.0 - 1.0;
	clipSpaceLocation.w = 1.0;
	vec4 homogenousLocation = invViewMatrix * invProjectionMatrix * clipSpaceLocation;
	return homogenousLocation.xyz / homogenousLocation.w;

float computeSpecular(vec3 normal, vec3 viewDir, vec3 lightDir, float shininess) {
	vec3 halfwayDir = (viewDir + lightDir) * vec3(0.5);
	return pow(max(dot(halfwayDir, normal), 0.0), shininess);

float computeDiffuse(vec3 normal, vec3 viewDir, vec3 lightDir) {
	return max(0.0, dot(normal, lightDir));

vec2 computeLighting(vec3 position, vec3 normal, vec3 viewDir, vec4 lightDir, float shininess) {
	float diffuseFactor = computeDiffuse(normal, viewDir, lightDir.xyz);
	float specularFactor = computeSpecular(normal, viewDir, lightDir.xyz, shininess);
	return vec2(diffuseFactor, specularFactor) * vec2(lightDir.w);

float computeSpotFalloff(vec4 lightDir, vec3 lightVec) {
	vec3 L = normalize(lightVec);
	vec3 spotDir = normalize(lightDir.xyz);
	float curAngleCos = dot(-L, spotDir);
	float innerAngleCos = floor(lightDir.w) * 0.0001;
	float outerAngleCos = fract(lightDir.w);
	float angle = (curAngleCos - outerAngleCos) / (innerAngleCos - outerAngleCos);
	float falloff = clamp(angle, step(lightDir.w, 0.001), 1.0);
	return pow(clamp(angle, 0.0, 1.0), 4.0);

vec4 lightComputeDir(vec3 worldPos, vec4 color, vec4 position, vec4 spotDir) {
	if (directional == 0) {
		return vec4(-position.xyz, 1.0);
	vec3 lightVec = position.xyz - worldPos.xyz;
	vec4 lightDir = vec4(0.0);
	lightDir.xyz = lightVec;
	float dist = length(lightDir.xyz);
	lightDir.w = clamp(1.0 - position.w * dist, 0.0, 1.0);
	lightDir.xyz /= dist;
	if (directional == 2) {
		lightDir.w = computeSpotFalloff(spotDir, lightVec) * lightDir.w;
	return lightDir;

float computeOcclusion(vec3 worldPos, vec3 lightPos, vec3 cameraPos) {
	//float distanceToLight = length(lightPos - cameraPos);
	//float distanceToFragment = length(worldPos - cameraPos);
	//return distanceToLight <= distanceToFragment ? 1.0 : 0.0;
	return 1.0;

void main(void) {
	vec4 diffuseColor = texture2D(diffuseTexture, pass_TextureCoord);
	if (diffuseColor.a == 0.0) {

	vec3 normal = normalize(texture2D(normalTexture, pass_TextureCoord).rgb * 2.0 - 1.0);
	vec3 position = reconstructPosition();

	vec3 viewDir = normalize(viewPos - position);
	vec4 lightDir = lightComputeDir(position, vec4(lightColor, 1.0), lightPos, lightDirPacked);
	vec2 light = computeLighting(position, normal, viewDir, lightDir, 32.0);
	vec4 color = vec4(light.x * diffuseColor.xyz + light.y * vec3(1.0), 1.0);
	out_Color = color * vec4(lightColor, 1.0) * computeOcclusion(position, lightPos.xyz, viewPos);

I strongly recommend that you first of all try to optimize the terrain rendering. It’s by far the slowest part, so optimizing it should be a priority. You should try to figure out what’s making it slow. Either you’re drawing too many triangles and/or your vertex shader is too expensive, so processing the geometry is the slow part, or the slow part is processing the pixels. You can diagnose this by changing the resolution you render at. If you halve the resolution (width/2, height/2), does the timing of the terrain rendering stay the same?

Performance is ~4x faster —> The slow part is either the fragment shader and/or the writing of the data to the G-buffer textures, so we should take a look at the fragment shader of the terrain to investigate further.
Performance roughly stays the same —> The slow part is the sheer number of vertices and/or the vertex shader. Take a look at the vertex shader and consider adding a LOD system to reduce the number of vertices of distant terrain, if you don’t already have that.

Concerning your lighting shader…

  • Be careful! Some compilers don’t accept automatic int–>float conversion. Line 30 and 33 seem to cause issues on at least some AMD hardware. Make sure those are float literals.

  • I’d recommend having different shaders for different types of lights. If you want, you can inject #defines into the source code to specialize the shader for different lights instead of relying on runtime branching on uniform variables. Although branching is not usually very expensive anymore (especially branching on uniforms as that means all shader invocations will take the same branch), it still forces the GPU to allocate enough registers for the worst case branch for all invocations, which can negatively impact texture read performance.

  • Consider using a signed texture format for your normalTexture (GL_RGB8_SNORM for example). They’ll be more accurate and are automatically normalized to (-1, +1) instead of (0, 1), so you won’t have to do that conversion yourself.

  • You seem to be doing lighting in world space instead of view space which is more common. Doing it in view space has the advantage of placing the camera at (0, 0, 0), which simplifies some of the math you have.

  • Even if you choose to do the lighting in world space, precompute the inverse projection view matrix and do a single matrix multiply. Currently, around half of the assembly instructions in your shader comes from this single line:

vec4 homogenousLocation = invViewMatrix * invProjectionMatrix * clipSpaceLocation;

This line first calculates a mat4mat4 operation, which you might recognize requires computing a 4D dot product for every single element in the array. This requires 4 operations per element, so that’s 64 operations right there. The resulting matrix is then used to do a mat4vec4 operation, which is much cheaper; this only requires 4 dot products = 16 operations. That means that changing it to the following code is 60% faster as it avoids the mat4*mat4 operation:

vec4 homogenousLocation = invViewMatrix * (invProjectionMatrix * clipSpaceLocation);

but the fastest will always be

vec4 homogenousLocation = invViewProjectionMatrix * clipSpaceLocation;

which will make the entire shader around 80% faster in total. In other words, doing that instead gets rid of 44% of the instructions in your entire shader. As your lighting shader is ALU-bound (lots of math instructions), you’re likely to see a very significant performance increase from that optimization alone.

Together, all the optimizations above (signed normal texture + view space lighting + precomputed matrix) should theoretically yield a 87% increase in performance of the lighting.

In addition:

  • Doing a fullscreen pass for the ambient light is extremely inefficient. That requires you to do an entire fullscreen pass just to add the ambientLight*diffuseColor to the lighting computation. This requires your GPU to read in the entire diffuse texture and blend with the entire lighting buffer, which is going to involve gigabytes of memory moved around and millions of pixels filled just so you can do three math instructions per pixel. I can see that you’re doing bloom/HDR right after the lighting. See if you can pack the ambient light calculation into one of those shaders instead. Adding 3 math instructions to a different shader is always going to be faster than doing an entire fullscreen pass.

  • I get the impression that fog could be merged into another shader as well to save the overhead of a fullscreen pass (like the DoF).

  • Your FXAA shader looks a bit expensive for some reason. Are you using a custom one?

  • You shouldn’t be doing fullscreen passes for local lights either. There are a number of different techniques for making sure you’re not rendering too many unnecessary pixels.

Render an actual sphere and only compute lighting for the pixels covered by the sphere (pretty simple to implement).
Use the scissor test and the depth bounds test to only process pixels in a rectangular area around the sphere that are within the depth bounds of the sphere (best and fastest, simple but some complicated math to calculate everything).
Consider implementing one of those.

Thanks for the notes. I have already implemented some of them and will post some updated timings when I’m done implementing the rest. But so far the scene renders at 75fps :slight_smile:

Cool, let me know if you have any questions. I’d love to hear about your results when you’re ready, too. =P

I’ve been working on my terrain shader, and I’ve encountered an interesting situation.

This code runs at 80fps:

void main(void) {
	vec4 blendSample = texture2D(blendMap, pass_TextureCoord);
	vec2 terrainTextureCoord = pass_TextureCoord * 75.0;
	vec4 rSample = texture2D(rTexture, terrainTextureCoord) * blendSample.r;
	vec4 gSample = texture2D(gTexture, terrainTextureCoord) * blendSample.g;
	vec4 bSample = texture2D(bTexture, terrainTextureCoord) * blendSample.b;
	vec4 aSample = texture2D(aTexture, terrainTextureCoord) * (1.0 - (blendSample.r + blendSample.g + blendSample.b));
	out_Color = rSample + gSample + bSample + aSample;

While this code runs at 125fps:

void main(void) {
	vec4 blendSample = texture2D(blendMap, pass_TextureCoord);
	vec2 terrainTextureCoord = pass_TextureCoord;
	vec4 rSample = texture2D(rTexture, terrainTextureCoord) * blendSample.r;
	vec4 gSample = texture2D(gTexture, terrainTextureCoord) * blendSample.g;
	vec4 bSample = texture2D(bTexture, terrainTextureCoord) * blendSample.b;
	vec4 aSample = texture2D(aTexture, terrainTextureCoord) * (1.0 - (blendSample.r + blendSample.g + blendSample.b));
	out_Color = rSample + gSample + bSample + aSample;

Are you on mobile? I’ve heard about that being a major issue on mobile, but never on desktop. Mobile likes to prefetch texture data before the shader starts, which isn’t possible if you need to run the shader to figure out the texture coordinates.

No I’m testing this on a laptop. Although the laptop is somewhat old.

I tried moving the texture coord calculation to the vertex shader, but that didn’t affect the time the shader takes.

Although I’m also curious as to why the timings change for each chunk of terrain rendered. The first chunk takes 42x as long to render as the fourth chunk even though they are all the same size.

[quote]Frame 414 : 7.737ms
Geometry : 2.094ms
Skybox : 0.709ms
Terrain : 1.268ms
Terrain Chunk 0 : 0.845ms
Texture Pack : 0.001ms
Terrain Chunk 1 : 0.2ms
Terrain Chunk 2 : 0.194ms
Terrain Chunk 3 : 0.023ms
Lighting : 1.968ms
Bloom and HDR : 1.32ms
FXAA : 1.083ms
Environment : 0.782ms
Final Render : 0.48ms

Are they also the same size on the framebuffer?

They’re the same size, with the same distribution of vertices, and the same matrices applied to transform each point. I also don’t do any culling or other mesh optimizations.

I mean that you may be fragment shader bound, hence chunk 0 may be responsible for much more rendered fragments than the other chunks.

What happens to the stats when you look up to the sky, without any pixel showing terrain?

Yeah, I just checked and it is fragment shader bound. I rotated the camera and chunk 3 became the most expensive chunk.

I’ve been rewriting my lighting shaders, and in the new shader, the diffuse factor [icode]dot(N, L);[/icode] only has diffuse light in a semi circle.

#version 330

uniform mat4 invProjectionMatrix;

uniform sampler2D diffuseTexture;
uniform sampler2D depthTexture;
uniform sampler2D normalTexture;

uniform vec3 lightColor;
uniform vec3 eyePosition;
uniform float radius;

in vec3 pass_LightPos;
in vec2 pass_TextureCoord;
in mat3 pass_NormalMatrix;

out vec4 out_Color;

void main(void) {
	vec4 clipPos = vec4(vec3(pass_TextureCoord, texture2D(depthTexture, pass_TextureCoord).r) * 2.0 - 1.0, 1.0);
	vec4 eyeSpace = invProjectionMatrix * clipPos;
	eyeSpace.xyz /= eyeSpace.w;
	vec3 distanceToLight = pass_LightPos - eyeSpace.xyz;
	float distance = length(distanceToLight);
	vec3 lightDir = normalize(distanceToLight);
	vec3 normal = texture2D(normalTexture, pass_TextureCoord).rgb * 2.0 - 1.0;
	vec3 albedo = texture2D(diffuseTexture, pass_TextureCoord).rgb;
	float attenuation = 1.0 - clamp(distance / radius, 0.0, 1.0);
	float diffuseFactor = max(dot(normal, lightDir), 0.0);
	if (diffuseFactor == 0) {
	vec3 diffuse = diffuseFactor * albedo * lightColor * attenuation;
	out_Color = vec4(diffuse, 1.0);

I’m sure it would be better to write this line:

   float diffuseFactor = max(dot(normal, lightDir), 0.0);
   if (diffuseFactor == 0) {


   float diffuseFactor = dot(normal, lightDir);
   if (diffuseFactor <= 0) {