Global Illumination via Voxel Cone Tracing in LWJGL

Oww, that sounds slow? Triangle count? Voxelization timings? Total frame timings? Are you using conservative rasterization for it? Voxel resolution? Memory usage?

It’s because even though the URL ends with .jpg, the http-response is actually HTML, which contains an img-tag to http://s14.postimg.org/azy70ypnz/vct_arealight.jpg

http://s14.postimg.org/azy70ypnz/vct_arealight.jpg

Awww, don’t leave us hanging! q.q I-if it’s medals you want, I’m sure we can agree on a price… ::slight_smile:

postimg.org is quite insidious. every direct link to an image eventually turns into an HTML page with a new direct link to the image… Please use imgur.com instead.

Sorry, haven’t uploaded something for a long time. I changed the latest image to use imgur. Thanks.

@theagentd: Of course it’s slow, it’s voxel cone tracing :smiley: Sorry, can’t give you timings right now, busy. I’m currently using a 256^3 grid, voxelizing sponza which has ~ 250k triangles, I think, the ferrari has a dozen thousend k triangles…I’m doing some post processing and using a custom deferred renderer. Getting frames per second around 35-40 or something on my GTX 770. No conservative rasterization of course, I don’t think one would use this on a gpu slower than a gtx 970, because it’s quite performance heavy.

If somewhen the time is given, I implement sparse voxelization with static/dynamic geometry. Long way to go, the current implementation should just give the java guys an impression of what could be done.

Finally had time to implement it more properly - can look pretty good

Looks absolutely fantastic!

Looks awesome. Let the medal slam begin :wink:

HHNNNNNNNGNGGGGGG

what about the framerate? ;D

sorry, can’t say much about it right now, because my gpu got busted and I haven’t had time yet to buy a new pc - I’m currently working on my notebook with a geforce 730m, which is so sad…needless to say my framerate is not above 20.

If you’re in need of other testing hardware, I think I’m not the only one willing to help you out with that. I can run it on a GeForce GTX 980M for you. :point:

Okay, for everyone interested in more info about my implementation, which in fact is still very naive, here are some performance facts:

Running on my good old GTX770, it takes:

0.4ms to reset all voxels with glClearTexImage (no distinction between static and dynamic voxels yet)
12.7ms to voxelize sponza completely (no atomic average used)
15.8ms for vct fullscreen post-process (1.9ms for diffuse cone trace only, 8ms for specular tracing only (straaaange), 4 diffuse cones, 1 specular)
3.34ms for mipmap generation with custom compute shader

so when revoxelization has to be done (object moves, light moves), my implementation doesn’t get the 30fps any more for sponza on my card. Currently working on a solution with distinction between static and dynamic objects…and a version with unlimited bounces of gi of course :slight_smile:

Not strange at all. A specular cone is thinner and therefore requires more iterations. Since it also reads from a larger mipmap, you’re probably completely thrashing the texture cache, further screwing up performance.

Gpu timers, right?

Primitive and mesh count?

That’s not what I found strange - it’s strange that the amount of time needed for diffuse and specular tracing is less then the sum of them both seperately.

@elect: Of course. Triangle count is ~260k, I’m drawing 393 entities, in the means of I use 393 vertex and index buffers to draw the scene, because no global buffer.

This can be explained with registers. Your shaders require a certain amount of registers for each shader invocation. GPUs rely on having multiple shader invocations in registers at the same time to quickly be able to switch to another invocation if one stalls due to a texture cache miss. Merging two shaders into one can cause the register usage to increase, reducing the number of invocations your GPU can keep in registers at the same time, hence reducing texture performance if you’re thrashing the cache, which you are. Keeping them separate and using blending to combine the result is probably a good idea in this case.

thank you for your explanation, that’s what I’ve already guessed :slight_smile:

There’s a bug in the Nvidia shader compiler causing shaders to sometimes cause suboptimal register usage. They’re refusing to acknowledge the problem, so I’ve given up on reporting it. Basically, sometimes it decides to store all your texture samples in temorary registers, then sum them up.

Example shader to reproduce the bug:


#version 150


//We definitely want the loop unrolled or performance is horrible
#pragma optionNV(unroll all)


//Disabling inlining makes the register count constant (2 registers) but has lots of overhead
//#pragma optionNV(inline 0)


/*
Register usage scales linearly with samples if inlining is on. Examples:
 -  64 samples: 34 registers
 - 128 samples: 66 registers
 - 256 samples: 130 registers
THIS SHOULD NOT HAPPEN. The shader is easily executed unrolled and inlined with only
2-3 registers regardless of sample count, and this increases the time it takes
to run the shader by 100-1000x longer for higher sample counts.
 */
#define SAMPLES 256



uniform sampler2D tex1;
uniform sampler2D tex2;

out vec4 fragColor;

void sample(inout vec3 sum1, inout float sum2, vec2 sampleCoords){
	vec3 v1 = texture(tex1, sampleCoords).rgb;
	float v2 = texture(tex2, sampleCoords).r;
	
	sum1 += v1 * v2;
	sum2 += v2;
}

void main(){

	vec3 sum1 = vec3(0);
	float sum2 = 0;
	
	for(float i = 0; i < SAMPLES; i++){
		sample(sum1, sum2, gl_FragCoord.xy + float(i));
	}
	
	fragColor = vec4(sum1 + sum2, 1.0);
}

Quick question. On your void sample is that Vec2 in, out, or inout? :o

The default is in.