Hey, guys. I know you’re getting tired of this but I just can’t stop it! ;D
I managed to get 3.7 million particles running at 60 FPS!
http://img820.imageshack.us/img820/3881/particlesa.jpg
This latest version is simply an improvement to fix a problem with the old transform feedback particle engine: It didn’t work with SLI (multiple GPUs). The driver does not explicitly synchronize the buffer memory after using transform feedback between GPUs, so trying to render particles from the last frame simply did nothing. I solved this by letting each GPU have its own particle buffer and update it twice (for 2 GPUs that is) but only render it once. That way my two GPUs can work with only their own feedback buffers and no driver synchronization. It’s not very effective of course since the updating is has to be done twice per GPU now, but at least the rendering of the particles is only done once per GPU which pretty much doubles fill-rate. Even with just my smoothed pixel-sized point particles I got a performance increase from 3.0 million particles to 3.7 million particles, an almost 25% increase in performance. Note that this was with an Nvidia GTX 295 which came out in January 2009; high-end at the time but not very spectacular today.
The main limitation at the moment is actually memory usage. My transform feedback code is extremely unoptimized when it comes to memory usage (I could reduce it by 25-30% with relative ease). The real however problem is that the driver isn’t smart enough to figure out that the buffers are only used by one GPU, so they are both allocated on both GPUs. For 2 GPUs, I need 4 full particle buffers to be able to ping pong between two of them on each GPU. 3 700 000 * 36 bytes * 2 * 2 = 508MBs of data… Of course, you don’t need that many particles in a real game, so memory usage will be a much smaller problem there. The high efficiency of this technique combined with the fact that I got it working at all on SLI/Crossfire systems still makes it worth using even if you “only” have 100k particles or so.
As I wrote above, fill-rate is basically double what it was before, while the cost of updating particles is the same. With 3 million particles with a point size of 1 (4 pixels covered per particle due to smoothing) particles I “only” got an almost 23.3% increase in performance (60 --> 74), but with 100 000 particles with a point size of 43 (1 849 pixels per particle) the performance increase was 93.3% (60 --> 116). Fillrate scales linearly with the number of GPUs, while particle performance does not scale at all. My program can handle any number of GPUs (= up to 4, a limitation of SLI/Crossfire), but memory usage may become a problem on quad-SLI systems. =S
It should be possible to further optimize this by simply doing the update once on each GPU but with a twice as high delta, but this may cause inconsistencies between the GPUs due to floating point errors that build up over the life time of a particle. Might be worth investigating though since particles generally live a very short life time. I’d estimate performance of such an implementation to at least 5 million particles on my graphics card since it’d scale perfectly with any number of GPUs.