GPUs are becoming faster in a way that consumer CPUs can’t until we get proper threading in our day-to-day programs. CPU performance isn’t improving in the way it did 10 years ago. At that time we were still following Moore’s “law” where the number of transistors per area unit (and usually also performance) doubled every two years, leading to an exponential increase in performance. Then we hit somewhere around 2.5-3.0 GHz and suddenly the performance increase pretty much stopped. Sure, we’re still getting slightly higher increases in clock speeds and better architectures with sophisticated branch prediction etc, but all in all we’re not seeing the same rate of increase nowadays. Instead, we’ve switched to having more cores. The reason lies in heat. Double the clock rate and increase the voltage to keep the CPU stable and you get between 4x and 8x as much heat. That’s where multicore solutions come in.
With this, it’s obvious that instead of having one fast core, it’d be much more efficient to have 4-8 cores running concurrently at half the clock speed. However, CPUs and most programming languages are only able to utilize a single core unless you manually split up work between them. Games and other consumer programs have traditionally been pretty bad at utilizing more than one or two cores with good scaling, so the CPU makers have been forced to try to cram out as much performance as possible from each core, despite the fact that it’s inefficient to do so.
The first “graphics cards” were simply specialized additional single core CPUs you could plug into your motherboard which the (main) CPU could offload 2D graphics tasks to. They weren’t even that fast, usually using less power than the main CPU, but they could give a solid performance boost since the CPU was free to do other things. As graphics got more advanced and started to venture into proper 3D, the manufacturers realized that rasterizing and shading was an extremely easy thing to parallelize. Two pixels can be independently calculated on two different cores. (This was before graphics cards also handled transforming vertices.) Then the rendered resolution started to increase to the point where the number of pixels was so large that using multiple cores became more viable. 15 years ago Nvidia coined the term “GPU” when it realesed the Geforce 256, which featured a grand total of 4 pixel shaders. It also featured hardware support for vertex transformations, but this hardware was slower than a decent CPU. After this, the number of pixel shaders and vertex shaders gradually increased, and for Nvidia the number of pixel shaders and vertex shaders eventually reached 24 and 8 respectively in the GTX 7900 in 2006 (which is a slightly faster version of the PS3’s GPU). At this point, GPUs were becoming so powerful with lots of memory that new rendering techniques started appearing.
The flexibility of programmable shaders lead to a new lighting technique called “deferred shading”. Deferred shading splits up lighting into two passes. In the first pass, the geometry pass, you store the data you’ll need for lighting (diffuse color, normals, shininess…) for each pixel into a huge buffer. In the second pass, the lighting is done by rendering the volume of the light, reading the lighting data for each pixel the volume intersects and computing the lighting. The key here is the unbalanced workload. In the first pass, the number of triangles processed was huge, but the pixel shader was essentially just a copy which depended on bandwidth, not number crunching. In the second pass, the workload flipped. Now we had very few triangles, but we had millions of pixels to light instead. In essence, only half the GPU was working at any given time. The first pass had heavy load on the vertex shaders and the second one had heavy load on pixel shaders. This lead Nvidia and AMD to move on to a unified architecture where GPUs had only one type of shader/core instead of two which could handle both vertex and pixel shading. That allowed the GPU to load-balance between vertex and pixel processing and adapt to the uneven load. Now, I’m not saying that deferred shading was the only reason they made this move, but for games that was probably the biggest reason.
And that’s pretty much where we are today. GPUs still contain a lot of fixed functionality hardware, like rasterizers that “fill” pixels that are covered by triangles, and raster output units which handle blending and the conversion and writing of the resulting pixel color, but the thousands of cores that GPUs have are now so flexible that they can be used for almost anything that can be parallelized. People are running physics engines, ray tracing, etc on GPUs nowadays. A simple for-loop where each object is processed independently can easily be run on multiple cores. It’s what’s called an embarrassingly parallel problem, which is basically a problem that can be split up into a large number of independent tasks that can be processed in parallel. Like pixel shading.
Well, I hope someone found that interesting. I just enjoy writing this stuff, I guess.