No, there’s no evidence that any driver does this specific optimization. For example:
gl_Position = projection * view * model * vec4(position, 1.0);
compiles to: http://www.java-gaming.org/?action=pastebin&id=1448, and can be optimized massively as this is 2 complete 4x4 matrix multiplications followed by a vector multiplication, which can be optimized to this (monstrous):
gl_Position = projection * vec4((view * vec4((model * vec4(position, 1.0)).xyz, 1.0)).xyz, 1.0);
which only compiles to: http://www.java-gaming.org/?action=pastebin&id=1449
Simple reordering of operations and assuming that the view and model matrices are affine matrices can make your shader go from 52 ALU cycles to 13 cycles. In practice this is not a that big difference as vertex shaders are often bottlenecked by the number of output parameters, but for simple vertex shaders like shadow mapping vertex shaders that only output gl_Position you should always optimize the shader as much as possible.
Yes it does, but it depends and implementation specific. ie it is up to the vendor. Assuming your using indexed primitives it can cache vertex calculations. ie properly optimized triangle soups you will very rarely calculate a vertex twice. But typically each and every vertex is calculated once even if some calculations can be taken “out of the vertex loop”. But again it is Meh since cards these days just never seem to hit vertex limits anymore, matrix calculations in fragments. Well that is a different story.
[/quote]
What you’re talking about is indexed rendering. If you draw a quad using 4 vertices and 6 indices forming two triangles, the vertex shader will only run 4 times (once for each vertex, not index). This is because the vertex shader’s output is cached in a finite-size cache (meaning that it MAY need to be rerun if it’s used after being evicted from the cache, but not in a trivial case like this), which allows OpenGL to reuse the vertex when building the two triangles. It has nothing to do with the shader’s compilation or performance.
EDIT:
To actually answer the question if it’s worth doing on the CPU or the GPU: It depends.
In a 2D game, you’re usually CPU limited by small draw calls, sorting of sprites, game logic, etc. In addition, you’re usually waaaaay underutilizing the GPU with super-simple shaders, tiny texture formats (RGBA8), so anything you can offload to the GPU is usually a win.
For 3D games, you’re usually heavily GPU limited by the sheer number of pixels and triangles you have to work with, heavy post-processing shaders, shadow maps, etc etc etc. In this case, optimize everything you can on the CPU. Premultiplying the view and projection matrices can save 20 instructions per vertex X 1 million vertices of GPU performance. It’s all a lot more complicated as the GPU has lots of individual hardware that can all bottleneck you in different ways, so it’s not a straightforward answer. In general, the answer is “premultiply everything you possibly can on the CPU” unless proven otherwise, which it never usually is.