After finally getting some time off from studying for tests (went pretty well xD) I managed to finish my GPU accelerated particle test. It’s the same test as the one I got Ra4king to test. That test was basically optimized as far as possible for CPUs. Multithreading, mapped objects, object pooling and other optimizations I’ve added have made it pretty much as fast as possible. It doesn’t scale very well on multiple cards though, due to the single-threaded copying of the particle data to the OpenGL mapped buffers. It’s not very expensive, but the more CPU’s you throw at the buffer updating, the lower scaling you get. See Amdahl’s law.
Ra4king’s testing showed a 200-250% scaling and 2.7-2.8 million particles at 60 FPS on a hyperthreaded quad core CPU
My own hyperthreaded dual core gave me about 160% scaling and ~1 million particles at 60 FPS.
—BORING TECHNICAL EXPLANATION STARTS HERE—
Simply put: I need to avoid the copying the data to the GPU each frame. The solution is to keep everything on the GPU and update it there. My implementation keeps 3 textures for this: A RGBA 32-bit float texture for xy position and xy velocity, a RGB 8-bit color texture for particle color, and a RG 16-bit life texture which stores the particle’s life time and current remaining life. Particle alpha = current life / life. The 2D textures are made big enough to fit the target number of particles. The theoretical maximum number of particles for a single set of textures is 8196 * 8196 16384x16384, more than 64 268 million particles, but performance obviously becomes the limit long before that.
I add particles by adding all the new particles for a frame into a ByteBuffer and drawing each particle as a point to all the textures using MRT. I simply add particles in a row for row, column for column, continuing where I left off the last frame. I simply assume that when I have written all particles to the whole texture and have to start at the beginning again, that particle has already died. I basically overwrite the oldest, hopefully dead particle each time. This avoids the need to have to find an empty place in the buffer each draw.
The next step is updating the particles. They are updated through a simple shader which is applied to all the pixels (= particles) in the textures. I obviously only do this to particles which life is over 0. It updates the velocity with gravity, approximated air resistance and screen edge collision bouncing, and then position based on velocity. I also reduce the current life by 1. I actually have two position/velocity and life textures, so that I can ping-pong the updating between them each frame.
Finally I draw them. I do this using a single glDrawArraysInstanced() call. I have a buffer with integers running from 0 to the particle texture height. I then draw (particle texture width) instances of this data. The final texture coordinate in the shader becomes this:
texPos = ivec2(gl_InstanceID, y); //y is the value from the buffer
In other words: I draw a point for each particle. This texture coordinate is passed to the geometry shader, which first samples the life texture and checks if it is alive. If it isn’t, it returns and no more textures are sampled and nothing is drawn. If it is, it samples the position/velocity texture for the position, the color texture for the color and calculates the alpha the life time. The fragment shader just writes the particle color.
This is obviously pretty slow even if I don’t have any alive particles, so it’s important to keep the texture size as low as possible. Simply updating and rendering the empty 1000x1000 particle textures takes about 5 ms.
—BORING TECHNICAL EXPLANATION ENDS HERE—
My performance? 3 million particles at 60 56 FPS. Yup. A three followed by six zeroes. The catch? I had to disable GL_POINT_SMOOTH. For my CPU based test, all antialiasing was free due to the fact that it was so very CPU limited that any GPU eye candy was free. I don’t think that’s really the problem here though. I seem to be fragment limited. Yeah, what the hell? Enabling GL_POINT_SMOOTH makes the points cover 4 pixels instead of just 1 pixel, plus the extra coverage computations. With antialiasing, I get about 2.4 2.1 million particles at 60 FPS. I think I’ve basically reached the limit of what my graphics card can possibly ever accomplish. I could optimize my draw calls and my texture layout, but I seriously doubt I can crank any more performance out of this. I might be able to improve the constant FPS cost of simply updating and drawing all the dead/uncreated particles a lot by developing a better draw algorithm, but the performance wouldn’t be better when you actually have that many particles.
Now, if I only had a certain someone to test this on his 3 times as powerful GTX 570, we’d be able to see some real performance.
EDIT: I had some incorrect data, not all particles were alive. Not a big deal though. =P