I made yet another 2D particle engine!
My first one was a CPU based particle engine which I worked on for quite some time. I tried most things that could be tried with it and I now have 9 working versions of it at the moment. The oldest one can only handle a measly 360 000 particles. It’s plagued by garbage collecting overhead, linked particle lists and ByteBuffer loading. The newest one runs features multithreading for any number of cores, MappedObject particles to avoid an extra data copy and other particle handling optimizations. Now when my laptop broke I had it upgraded with another 4 GB of RAM. The exact same old stick is still in it, but a new one with the same timings and speeds has been added, meaning that it can now use dual channels. Heh, ever heard of dual channels making a difference? Well, now you have. This particle engine went from 1 100 000 particles to 1 600 000 particles since it was so memory bottlenecked. Memory bandwidth makes a HUGE difference here it seems. The advantages of this one is the flexibility of the CPU. We can do collision detection against terrain or animate particles however we want.
The second engine I made moved everything from the CPU to the GPU. All particle data was stored in textures and one shader was used for updating particles and one for rendering them from the data stored in the textures. This was quite a bit faster, though the difference isn’t as big as before my RAM upgrade. It could handle 2 100 000 particles, but it was also the most complicated of the particle engines I’ve made, using 3 shaders, float textures and instancing. It also requires OGL3+ for the 32-bit float textures. It’s awfully complicated, but wins big in performance, especially since it also leaves the CPU completely free for whatever else you need to get done.
The third engine was also a GPU implementation but used OpenCL for updating particles instead. This simplified lots of things, since I could just use a basic VBO and update the data in it instead of having to use textures. It simplified a lot (well, except for the fact that I had to learn OpenCL) and had identical performance to the OpenGL one. This is very interesting in the first place, and shows that OpenGL, despite the overhead of the shaders and textures OpenGL had IDENTICAL performance as OpenCL, which is MADE for computing —> OpenGL is optimized as hell! This one also requires a OGL3+ card since OpenCL requires it, but at least is a bit less complicated (and less insane).
All the earlier three engines had one big problem which limited their usability in a real game. They had excellent peak performance, meaning that they performed the best when the number of alive particles was close to equal the number of particles allocated. They all had particles stored all over a MappedObject, some textures or a VBO, meaning that it was difficult to find out which and how many particles that were alive each frame. All allocated particles had to be checked, updated and processed each frame, regardless of if they were actually alive or not. Of course they earlied out of dead particles, but the fact that they had to be checked was a serious performance problem. They also did not preserve the ordering of the particles since they all had different tactics for avoiding having to find a dead particle to overwrite each time one was generated. These were really bad limitations, but I saw no way of solving them without severely impacting performance by a factor of 10 or so, until…
Enter Transform Feedback! Transform Feedback allows you to capture vertices before they’re rasterized, allowing you to “render” vertices to a VBO. Their main use is processing expensive vertices (skinning, tessellation, animation) once and then rendering them multiple times, for example to a number of shadow maps and then to the screen, but they have a VERY interesting feature thrown in: They capture after the geometry shader! What this allows you to do is to both generate new vertices and remove vertices in your geometry shader, and they will end up in your VBO in the same order that they were created! I have no idea what kind of black magic they’re using to get it working, but particle engines must have been exactly what they had in mind when they added this!
Transform feedback is ridiculously easy to use and requires only a few lines of setup. Just look at the line count!
- CPU: 420 lines (pretty messy) + a small multithreading library
- OpenGL: 466 lines + 7 shaders
- OpenCL: 350 lines + 1 OpenCL kernel + 3 shaders
- Transform feedback: 187 lines + 3 shaders
Magic! The particles are updated with a geometry shader, and if they die they are simply discarded, and thanks to transform feedback the output buffer is completely consolidated. When all old surviving particles are done, we just draw the new ones and they’ll end up after the old ones. Transform feedback also has another godsent features: glDrawTransformFeedback(). This functions works like glDrawArrays(), but draws the number of vertices that transform feedback produced without forcing you to read the value back to the CPU (which would stall everything and kill performance). It can’t possibly be easier than this. Draw a new particle to transform feedback and the engine automatically handles it until it dies. I mean, this is it!
Sadly performance dropped a bit. This one only handles 1 200 000 particles. That’s even less than the CPU implementation! Hopefully there’s a solution though. Transform feedback isn’t very flexible with its output types, so only 4-byte ints and floats can be output from the shader since that’s the only two types that are supported. My earlier GPU implementations used 24 byte particles and stored color in 4 bytes and life+maxLife in 2 shorts. For my transform feedback test I simply made everything into floats, giving me 32 byte particles! That’s 50% more data! The only thing the shader does is pretty much position += velocity, so it was probably memory bottlenecked even before I made the particles bigger. I suspect that I can make it around 33.3% faster to around 1 600 000 particles by packing the data more efficiently.
Pros:
- Keeps particles ordered
- No alive particles? Virtually no cost then!
Cons:
- Requires OpenGL 3 (my implementation actually uses the OGL 4 version of transform feedback)
- Might be too awesome for some people (me)
Now I just need to pack my particle data better and implement a radix sort with transform feedback for sorted 3D particles. =D