OpenGL multiple shaders nonsense!

What I am asking is not related to how vertex and fragment shaders work, as I already understand them at a basic level. What I am asking is how to reuse them.

I recently had a problem where all of my objects were using an orthographic projection by default. I understand now that I have to apply projection matrix, and that it is recommended that I do this via matrix multiplication in a vertex shader, rather than using the fixed pipeline.

My problems is getting the projection working with multiple kinds of shader programs. As it stands, I know I have to make a vertex shader which applies vertex transformations and sends vertex attributes to the fragment shader. While I can apply the same kinds of transformations by simply rewriting the transformation code for each vertex shader, it seems like it would be a better idea to make a vertex shader specifically designed for transforming, make another vertex shader for passing UV coords in the event that I want to use a texture, etc and etc…

The tutorials that I’ve been looking at seem to hint at the possibility of attaching more than two shaders to a shader program.

                           Texture Shader Program
UVVertexShader    ->    ProjectionVertexShader    ->    FragmentShader    ->

If this kind of concatenation possible? I guess my main problem is code modularity when using shaders. I can’t think of a way to divide the work among them, and I certainly don’t want to make a mega-shader program that can do everything, but only has certain features available when a uniform value is toggled. I’m new to GLSL, so I understand if this is a newbie question.

This is actually a very good question to ask!
And it kept the OpenGL community busy for quite some long time.

Concatenating multiple vertex shaders is sadly not possible.
But there are solutions:

The first and worst possibility for modularity is probably through transform-feedback, where you let one shader transform your vertices and write those out into a buffer object.
Then take another shader that sources the already-transformed data from that buffer.

The next best approach is ARB_separate_shader_objects.
It allows to mix-and-match multiple shader objects of different stages and programs through “pipelines.”
You can therefore have two vertex shaders each compiled to a separate program and one fragment shader also compiled and linked to a single program.
Then whenever you need the first vertex shader, you would “bind” that shader to the vertex shader stage in a particular pipeline object together with your fragment shader.
The advantage of that is that you do not have to have a big “uber” shader that contains everything and you also do not have to build a shader program for each combination of shader objects and stages you have.

The next best thing is ARB_shader_subroutine.
It is however only available on GL4 capable cards.
That allows to have a limited support for “function pointers” in GLSL, like you know them from C/C++.
There you can define GLSL functions to be “subroutines” for a given subroutine type name.
As you wanted, you can then use uniform variables to “activate/select” a specific function of a given subroutine type during runtime without the need to rebuild your program or rebind your program pipelines.

Hope that helped!
Cheers,
Kai

Actual well performing solution to use uber shaders with #if and defs. Transform feedback is definitely not for this. Also no one is actually using subroutines either for performance reasons.

The easiest way in my opinion is to just make one big shader which has all the outputs that you need. Depending on what outputs your fragment shader then uses, it’ll optimize the vertex shader and remove any unused attributes, uniforms and varyings. For example, if you have a massive vertex shader which does skeleton animation of the vertex, outputs the normal matrix multiplied and skeleton animated normal, tangent and bitangent for normal mapping, and view space position and lighting, etc etc etc and you combine this one with a shader program which has an empty fragment shader only used for depth rendering, the compiler will optimize away everything the fragment shader doesn’t use (= everything but the skeleton animated gl_Position). The actual vertex shader after compilation will have no trace of normals inputs or outputs.

On the other hand, if you use separate shader objects or shader subroutines, I don’t think the compiler has the opportunity to recompile the shaders based on what the other shaders actually use of their inputs and outputs. That’s actually the whole point of these extensions; to avoid having to link an entire shader program for each permutation of shaders. It’s true that the flexibility you gain can be worth the possible GPU performance penalty, especially if it lets you save some CPU time by letting you switch shaders more effectively, but to be honest I haven’t encountered a situation where this has been a good trade-off. Since I haven’t tried it that much, I’m not sure if some drivers possibly could optimize combinations of shaders by relinking and caching the combinations of shaders that you use when using separate shader objects, but probably not. GPUs don’t actually have a stack and all functions are always inlined (loops are handled using special instructions). That’s why subroutines is such a new feature, since that it actually allows you to have virtual/abstract functions is a OGL4 hardware feature, so I’m 100% sure they don’t optimize shader subroutines in that way as that’d defeat the purpose of it in the first place.

There is another way, similar to what theagentd said before. We will write our shaderloader to use [icode]#include [/icode] directive. GLSL doesn’t have this, but this is what I have learned from TheBennyBox on youtube. If you had implemented this, you can write code, like for example…


#version 330 core

layout(location = 0) in vec3 position;
layout(location = 1) in vec4 color;
layout(location = 2) in vec2 texCoords;

out vec4 vColor;

uniform bool useTexture;

void main()
{
    vColor = color;

    if (useTexture)
    {
        #include uvVertexShader.vs
    }
    else
    {
        #include projectionVertexShader.vs
    }
}

You just separate the code into different files, in the end, this makes editing shaders easier.

SHC’s solution is actually the easiest option and will be just as fast, as that conditional will not be any performance issue thanks to useTexture being a uniform bool/int (SHC… it should not be a float).

EDIT: Improved clarity.

I forgot that there is a bool, modified the code. Thanks for pointing it out Roi.

Hi pitbuller,

it’s good to hear of someone else also having used shader subroutines and can say something about their usefulness, since I have very few experience with them.
Could you provide empirical data as to whether, when and how much it actually made a performance difference for you using them?

My finding is that it is never an issue, because the shaders I saw were mostly bandwidth-limited, as they fetched a lot of data through buffers and textures/images for things such as deferred shading.

The next thing that made a huge difference in performance was diverging if/then/else branches for a given thread-warp.
It can be - though I am not a hardware engineer or drivers developer, so I cannot state it for sure - that shader subroutines are less of an issue because of coherence in the path chosen by a thread-warp (i.e. they all take the same path = the same subroutine).

On the other hand shader subroutines should then also be as performant as an if/then/else on a uniform.
But I will be most happy to hear about your or everyone else’s findings.

Anyway, as a conclusion: Everyone should try it for themselves, as only that will give them reliable, empirical confirmation. :slight_smile:

Thank you!
Cheers,
Kai

from my experience with subroutines … no, flow-control over uniforms is way slower than using subroutines. optimisations kick in as you’d use preprocessor statements (#ifdef etc.).

practically, dont worry about it. use as many if-else uniforms as you need for the task. later - if you need the extra oomph you can still deal with the reflection-api to get subroutines to works, which is pretty confusing.

the other interesting thing about subroutines is, when you cascade and chain them you can create super-flexible shader programs which do not look like the flying spagetthi monster. it’s a different way to write shaders altogether.

about the separate shader objects, one thing nice to know : before linking a shader program you can pass a hint to the program [icode]GL41.glProgramParameteri(program_id, GL41.GL_PROGRAM_SEPARABLE, GL11.GL_TRUE/GL_FALSE);[/icode] which obviously exists to enable/disable optimisations which may or may not be good for your shader. my guess is, a not separable shader can be optimised more aggressively.

to give some feedback to the OP : there is nothing like that (shader-concat) defined by the GL. we should not abuse feedbackbuffers, subroutines or separation to mimic shader-orchestration. the best application for them is what they’re designed for.

but there’s something else. i guess almost everybody does something similar to that; just before compiling the shader, processing the source-code, injecting #define’s and macros, regex-replacements, “importing” other files, etc. i think thats a very handy tool to have. you can extend a system like that with functionality as you described it. basically, remixing shader code on the client side. using separation or subroutines should be an implementation detail of this.

o/

Thanks, basil, for your insights!
Goes along the same lines that I observed.

Divergent branching on uniforms is so heavily slow, but if they converge then it’s like a no-op…
Hm…, actually the stroke-through above line from me is complete nonsense :), as “uniform” means “uniformity” and therefore “equal for all threads.” :slight_smile:
I actually meant branching in general without uniforms.
Now I am curious as to why branching over uniforms actually is slower than subroutines…

But anyhow, regarding this, there is also one interesting extension:
ARB_shader_group_vote,
that can be used to avoid branch divergence by computing whether all threads of a warp will take either the one or the other path and then decide which path all of them should take.

I’m a newbie in shader since i just start to play with theses, but i’m a little surprised from what we read here !!

I bieleved we can attach more than one shaders from the same type on a program !!!

We can read on the glAttach this documentation:

All operations that can be performed on a shader object are valid whether or not the shader object is attached to a program object. It is permissible to attach a shader object to a program object before source code has been loaded into the shader object or before the shader object has been compiled. [b]It is permissible to attach multiple shader objects of the same type because each may contain a portion of the complete shader[/b]. It is also permissible to attach a shader object to more than one program object. If a shader object is deleted while it is attached to a program object, it will be flagged for deletion, and deletion will not occur until glDetachShader is called to detach it from all program objects to which it is attached.

After some research, we can have more than one shaders of the same type but only one “main” method ? is it right ?

[quote]After some research, we can have more than one shaders of the same type but only one “main” method ? is it right ?
[/quote]
Yes. That is correct.

But let me explain a little bit further. There are three things that you need to separate when talking about shaders.

First: The shader stages.

Currently, there are these stage:

  • Vertex
  • Tessellation Control
  • Tessellation Evaluation
  • Geometry
  • Fragment
  • Compute

These are like the “types” or “kinds” of shaders you can have and each has a different domain (i.e. “things on which they work”).
Shaders in the “vertex” stage work on vertices, as you know, and shaders in the “fragment” stage work on fragments.
So, the stages are the kinds of shaders you can have.

Second: Shader objects.

These are comparable to object files in native programming languages, like C, C++ or Objective-C.
They are compiled from GLSL code and contain the compiled binary code as well as the description of the interface of this particular shader object, which consist of the “symbols” defined, declared and referenced by that object (i.e. declared uniforms, input/output variables, functions).
In GLSL a single shader object can only belong to a single shader stage.
There can, however, be arbitrarily many shader objects for the same shader stage.

Third: Shader programs.

Shader programs are created by linking multiple shader objects together.
This is also comparable to the linking process of native languages. The linking process takes all shader objects, identifies the declared and referenced functions and variables in them and links them together. That means, things of equal name within two different shader objects get identified to be the same.

Here, you could have a shader object A of stage “vertex” declaring and defining a function ‘f’ but not using it. Then you can have another shader object B, also of stage “vertex”, which declares the existence (i.e. “prototype”) of that ‘f’ and calls it.
During linking, the linker detects that the ‘f’ in the first shader object A is actually the same as the ‘f’ within the second object B and links declarations and references to the single definition.

Multiple shader objects of the same shader stage make quite some sense as you can modularize and factor out common functions you need in many of your shaders.

Can you have more than one main()?

Regarding functions and other symbols, you can only have a single function with the same name throughout all shader objects of the same stage. Therefore, you can also only have a single “main()” function.

Cheers,
Kai

Uniform branchin is not free. Conditional is usually not a problem but the register pressure. Compiler has to allocate registers for worst case scenario and this will reduce how many wavefronts can be run parallel. On current gen consoles this is the main bottleneck. On nvidia this isn’t that big issue thought.

Wow, if register allocation is not completely wrongly and suboptimally implemented, then I would say no, as that register is likely to be reused by code later (or earlier) if the same register was also allocated for other variables.
But this is an assumption that I make about the LLVM register allocator!
So, unless you only have a single value in your code, which is your condition, that needs a register, then yes.
But then your simple shader will also not be a performance issue. :slight_smile:

Registers are reused but the max amount matter. If you don’t have the knowledge why you do that kind of guessing? You can actually see the register usage(and lot of other things) with this tool http://developer.amd.com/tools-and-sdks/graphics-development/gpu-shaderanalyzer/

Just tested this with skinning shader. Uniform brach over skinning still used all 27GPR but without the skinning code its used only 3GPR. This is real and documented issue.

http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/05/GCNPerformanceTweets.pdf
http://bartwronski.files.wordpress.com/2014/03/ac4_gdc_notes.pdf (from page 59.)

Thanks for the link to this great tool! Will be valuable for AMD people.
Looks like for Nvidia folks there is also hope :slight_smile: