Multiple shader passes LWJGL

RobinB · May 14, 2013, 5:03pm

Hello,

Im having a simple question, of wich i can hardly find any info.
How to use multiple shader passes on some rendered data?

Usually i can just bind one shader (including one frag and vert shader), and do the drawing.
But how does this work when i want to use 2 vertex shaders and one fragment shader for in example gaussian blur?
And what should i do if i want do do another pass to post process this blurred data?

So i want to do this:

one pass horizontal blur
one pass vertical blur
one pas post processing

Is it possible do do this at once, or do i need to buffer the scene to a fbo each pass?
Thanks in advance

davedes · May 14, 2013, 7:47pm

You need to use FBOs. See my tutorials here on the subject:

RobinB · May 14, 2013, 8:08pm

Nice tutorials, thanks.

It seems really slow to use multiple fbo’s just for a little blur :(.

pitbuller · May 14, 2013, 9:16pm

Slow in developing time or rendering time? Never assume anything but measure.

My box2dLights use two pass gaussian blur and it run fine even on couple year old android devices.

Reason to use two pass blur vs one pass is that you need only 2N samples instead of N^2. For bigger kernels this saving is really big factor.

http://www.unrealengine.com/files/downloads/Smedberg_Niklas_Bringing_AAA_Graphics.pdf
On that paper they present 6th pass god ray post process effect and that run fine on iPad2.

RobinB · May 14, 2013, 9:27pm

I ment rendering time, and it does affect some devices.
My laptop with a HD3000 or something needs 1ms to render 1 fbo pass, but maybe thats an exception.

I understand why i need two passes, thats why i created this thread.
Also your tutorial explains this pretty clear =D

Thansks for all of your info, it really helps.
Now i only need to implement some liquid behaviour and then i can show the result :).

Amazing presentation btw, this stuff is informative =D

theagentd · May 14, 2013, 11:21pm

I’ve never seen a graphics card that renders slower when rendering to an FBO instead of directly to the window. The slow part usually depends more on how many pixels you process and how expensive the fragment shader is. It’s mostly independent of what you render to as long as the render target has the same bit depth.

You can optimize it even further by exploiting bilinear filtering to get a correctly weighted average of two texels per texture sample. You can achieve a 9x9 gaussian blur using only 5+5 texture samples, which means you’ll go from 81 samples down to 10. Another trick is that you can do a 3x3 gaussian blur using only 4 texture samples and a single pass.

pitbuller · May 15, 2013, 8:18am

Already doing linear sampling trick.
Also I am calculating texture coordinates at vertex shader and passing them as varying to get rid of all “dependent” texture lookups with mobile hardware and some fragment shader math. Last part give over twice the performance compared to traditional approach. Still don’t think it would do any good on pc hardware.

Danny02 · May 15, 2013, 9:04am

what to you mean with calculating texcoords in the vertex shader?

just use simple constants in the fragment shader

RobinB · May 15, 2013, 11:27am

Noo, calculating texture coordinates every pixel is expensive .
Best way is to precalculate these vars:

/* HBlurVertexShader.glsl */
attribute vec4 a_position;
attribute vec2 a_texCoord;
 
varying vec2 v_texCoord;
varying vec2 v_blurTexCoords[14];
 
void main()
{
    gl_Position = a_position;
    v_texCoord = a_texCoord;
    v_blurTexCoords[ 0] = v_texCoord + vec2(-0.028, 0.0);
    v_blurTexCoords[ 1] = v_texCoord + vec2(-0.024, 0.0);
    v_blurTexCoords[ 2] = v_texCoord + vec2(-0.020, 0.0);
    v_blurTexCoords[ 3] = v_texCoord + vec2(-0.016, 0.0);
    v_blurTexCoords[ 4] = v_texCoord + vec2(-0.012, 0.0);
    v_blurTexCoords[ 5] = v_texCoord + vec2(-0.008, 0.0);
    v_blurTexCoords[ 6] = v_texCoord + vec2(-0.004, 0.0);
    v_blurTexCoords[ 7] = v_texCoord + vec2( 0.004, 0.0);
    v_blurTexCoords[ 8] = v_texCoord + vec2( 0.008, 0.0);
    v_blurTexCoords[ 9] = v_texCoord + vec2( 0.012, 0.0);
    v_blurTexCoords[10] = v_texCoord + vec2( 0.016, 0.0);
    v_blurTexCoords[11] = v_texCoord + vec2( 0.020, 0.0);
    v_blurTexCoords[12] = v_texCoord + vec2( 0.024, 0.0);
    v_blurTexCoords[13] = v_texCoord + vec2( 0.028, 0.0);
}

theagentd · May 15, 2013, 1:16pm

This is false, at least for AMD hardware. The problem is that interpolating vertex attributes for each pixel is done by specialized hardware, and with that many vertex attributes that need to be interpolated you run
into a bottleneck there instead. High end hardware has no problem with this, but low end hardware can hit a huge bottleneck here. I compared these two shaders against each other:

Interpolated texture coordinates: http://www.java-gaming.org/?action=pastebin&id=582
Calculate coordinates per pixel: http://www.java-gaming.org/?action=pastebin&id=583

Using AMD’s ShaderAnalyzer I checked the (theoretical) performance of those two shaders. On newer high-end and mid-end cards the performance was the same since they’re bottlenecked by the texture fetches, but for all low-end and most older cards performance was much worse for the interpolated one.

[tr]
[td]Name
Radeon HD 2400
Radeon HD 2600
Radeon HD 2900
Radeon HD 3870
Radeon HD 4550
Radeon HD 4670
Radeon HD 4770
Radeon HD 4870
Radeon HD 4890
Radeon HD 5450
Radeon HD 5670
Radeon HD 5770
Radeon HD 5870
Radeon HD 6450
Radeon HD 6670
Radeon HD 6870
Radeon HD 6970
[/td]
[td]
Throughput(Bi) interpolated
200 MPixels\Sec
200 MPixels\Sec
791 MPixels\Sec
827 MPixels\Sec
300 MPixels\Sec
750 MPixels\Sec
1500 MPixels\Sec
1500 MPixels\Sec
1700 MPixels\Sec
179 MPixels\Sec
1033 MPixels\Sec
2267 MPixels\Sec
2267 MPixels\Sec
828 MPixels\Sec
2560 MPixels\Sec
1680 MPixels\Sec
2816 MPixels\Sec
[/td]
[td]Throughput(Bi) calculated
160 MPixels\Sec
213 MPixels\Sec
791 MPixels\Sec
827 MPixels\Sec
320 MPixels\Sec
800 MPixels\Sec
1600 MPixels\Sec
2000 MPixels\Sec
2267 MPixels\Sec
306 MPixels\Sec
1033 MPixels\Sec
2267 MPixels\Sec
2267 MPixels\Sec
1412 MPixels\Sec
2560 MPixels\Sec
1680 MPixels\Sec
2816 MPixels\Sec[/td]
[/tr]

With the exception of the HD2400, calculating texture coordinates is always equally fast or faster.

Riven · May 15, 2013, 1:45pm

Please remember that this was a performance trick in the context of mobile GPUs, as described in the link posted earlier in this thread:

theagentd · May 15, 2013, 3:17pm

Ouch, I missed that. I was actually going to note that I read some similar trick for mobile GPUs for blurring, but I decided not to in the end. xd

RobinB · May 15, 2013, 4:44pm

Ah thanks for the info :).
Interesting to see the table, im crious how this works on other cards.
Guess i should test more stuff before assuming they are true.

pitbuller · May 15, 2013, 5:06pm

Stuffing texcoord math is just special case for some mobile chips notably powerVR. If you don’t touch textureVoord in any way fragment shader unit may prefetch texels before shader even run. Also this is just not theoretical performance gain but battle tested method.
Simple example: I switched from shadow2DProjEXT to shadow2DProj and gained 1.5ms rendertime when doing fullscreen shadow mapping pass with iPhone4s today. This was because proj and bias variants always cause “depentant” texture reads.