Huge particle engine ramble!

theagentd · June 17, 2012, 5:57pm

I made yet another 2D particle engine!

My first one was a CPU based particle engine which I worked on for quite some time. I tried most things that could be tried with it and I now have 9 working versions of it at the moment. The oldest one can only handle a measly 360 000 particles. It’s plagued by garbage collecting overhead, linked particle lists and ByteBuffer loading. The newest one runs features multithreading for any number of cores, MappedObject particles to avoid an extra data copy and other particle handling optimizations. Now when my laptop broke I had it upgraded with another 4 GB of RAM. The exact same old stick is still in it, but a new one with the same timings and speeds has been added, meaning that it can now use dual channels. Heh, ever heard of dual channels making a difference? Well, now you have. This particle engine went from 1 100 000 particles to 1 600 000 particles since it was so memory bottlenecked. Memory bandwidth makes a HUGE difference here it seems. The advantages of this one is the flexibility of the CPU. We can do collision detection against terrain or animate particles however we want.

The second engine I made moved everything from the CPU to the GPU. All particle data was stored in textures and one shader was used for updating particles and one for rendering them from the data stored in the textures. This was quite a bit faster, though the difference isn’t as big as before my RAM upgrade. It could handle 2 100 000 particles, but it was also the most complicated of the particle engines I’ve made, using 3 shaders, float textures and instancing. It also requires OGL3+ for the 32-bit float textures. It’s awfully complicated, but wins big in performance, especially since it also leaves the CPU completely free for whatever else you need to get done.

The third engine was also a GPU implementation but used OpenCL for updating particles instead. This simplified lots of things, since I could just use a basic VBO and update the data in it instead of having to use textures. It simplified a lot (well, except for the fact that I had to learn OpenCL) and had identical performance to the OpenGL one. This is very interesting in the first place, and shows that OpenGL, despite the overhead of the shaders and textures OpenGL had IDENTICAL performance as OpenCL, which is MADE for computing —> OpenGL is optimized as hell! This one also requires a OGL3+ card since OpenCL requires it, but at least is a bit less complicated (and less insane).

All the earlier three engines had one big problem which limited their usability in a real game. They had excellent peak performance, meaning that they performed the best when the number of alive particles was close to equal the number of particles allocated. They all had particles stored all over a MappedObject, some textures or a VBO, meaning that it was difficult to find out which and how many particles that were alive each frame. All allocated particles had to be checked, updated and processed each frame, regardless of if they were actually alive or not. Of course they earlied out of dead particles, but the fact that they had to be checked was a serious performance problem. They also did not preserve the ordering of the particles since they all had different tactics for avoiding having to find a dead particle to overwrite each time one was generated. These were really bad limitations, but I saw no way of solving them without severely impacting performance by a factor of 10 or so, until…

Enter Transform Feedback! Transform Feedback allows you to capture vertices before they’re rasterized, allowing you to “render” vertices to a VBO. Their main use is processing expensive vertices (skinning, tessellation, animation) once and then rendering them multiple times, for example to a number of shadow maps and then to the screen, but they have a VERY interesting feature thrown in: They capture after the geometry shader! What this allows you to do is to both generate new vertices and remove vertices in your geometry shader, and they will end up in your VBO in the same order that they were created! I have no idea what kind of black magic they’re using to get it working, but particle engines must have been exactly what they had in mind when they added this!

Transform feedback is ridiculously easy to use and requires only a few lines of setup. Just look at the line count!

CPU: 420 lines (pretty messy) + a small multithreading library
OpenGL: 466 lines + 7 shaders
OpenCL: 350 lines + 1 OpenCL kernel + 3 shaders
Transform feedback: 187 lines + 3 shaders

Magic! The particles are updated with a geometry shader, and if they die they are simply discarded, and thanks to transform feedback the output buffer is completely consolidated. When all old surviving particles are done, we just draw the new ones and they’ll end up after the old ones. Transform feedback also has another godsent features: glDrawTransformFeedback(). This functions works like glDrawArrays(), but draws the number of vertices that transform feedback produced without forcing you to read the value back to the CPU (which would stall everything and kill performance). It can’t possibly be easier than this. Draw a new particle to transform feedback and the engine automatically handles it until it dies. I mean, this is it!

Sadly performance dropped a bit. This one only handles 1 200 000 particles. That’s even less than the CPU implementation! Hopefully there’s a solution though. Transform feedback isn’t very flexible with its output types, so only 4-byte ints and floats can be output from the shader since that’s the only two types that are supported. My earlier GPU implementations used 24 byte particles and stored color in 4 bytes and life+maxLife in 2 shorts. For my transform feedback test I simply made everything into floats, giving me 32 byte particles! That’s 50% more data! The only thing the shader does is pretty much position += velocity, so it was probably memory bottlenecked even before I made the particles bigger. I suspect that I can make it around 33.3% faster to around 1 600 000 particles by packing the data more efficiently.

Pros:

Keeps particles ordered
No alive particles? Virtually no cost then!

Cons:

Requires OpenGL 3 (my implementation actually uses the OGL 4 version of transform feedback)
Might be too awesome for some people (me)

Now I just need to pack my particle data better and implement a radix sort with transform feedback for sorted 3D particles. =D

matheus23 · June 17, 2012, 6:09pm

Congratz to 99 Medals
EDIT: better congratulating to 100 Medals

theagentd · June 17, 2012, 7:01pm

Appreciated for advertising OpenGL 3+! Wooh!

matheus23 · June 17, 2012, 9:20pm

Now I think about the tweet from Tiy, (developer in Chucklefish, writing “Starbound”) sayin, their engine would be able to render 10k particles

ra4king · June 17, 2012, 11:21pm

I can hear a looouuuudd whooshing sound going over my head. I think I need to re-read this a couple times. Nah that won’t work. me goes to keep reading the SuperBible

EDIT: SO…uh…you keep making engines and tools…any games in sight? ;D

ReBirth · June 18, 2012, 12:04am

Where is the proof?! ;D

sproingie · June 18, 2012, 4:14am

Nothing wrong with making tools. Got source for this one?

theagentd · June 18, 2012, 2:30pm

Well, for 3D I’d need at least 8 bytes more for the additional dimension, and I’d also need a proper geometry shader to expand the points to quads. In the end it’ll probably end up being fragment limited. You’d also probably want texturing, smooth particles (fade them out as they get closer to the ground to prevent a sharp edge) and maybe even lighting, in which case it’ll get even more expensive. There are some tricks you can apply though, like rendering the particles at half or quarter resolution and then upscale them with a smart filter. The blurryness is very hard to notice for smoke, fire, etc, and it reduces by fillrate needed by 4-16x. However, having 1m particles in a game isn’t going to leave much CPU/GPU time left over for the actual game, is it? =S

Worth mentioning is that I used GL_POINT_SMOOTH to anti-alias the radius 1 points I have. This basically causes each particle to cover 4 pixels instead of one and also increases the cost per pixel for the coverage calculation. Blending was still left on though. The performance gain of disabling this widely differs between the 4 engines:

CPU: CPU limited, so no impact what so ever. Exact same performance.
OpenGL and OpenCL: Huge boost. 2 100 000 —> 2 800 000 particles.
Transform feedback: Small boost 1 200 000 —> 1 250 000 particles.

This is the cheapest possible particle we can draw, and doesn’t mean much. Each pixel covers 1 pixel which it just colors with a single color, so it’s pretty much guaranteed to not be fragment limited at least. Let’s keep GL_POINT_SMOOTH disabled and bump up the point size to 5, meaning each pixel covers a square 5x5 pixel area:

CPU: 1 600 000 —> 1 450 000
GL and CL: 2 800 000 —> 1 750 000
Transform feedback 1 250 000 —> 1 150 000

The GL/CL version takes the biggest hit. It had the best vertex throughput, but when the bottleneck shifts to fragments they all approach the same performance. In this case, transform feedback wins since it is the most flexible one since it also preserves ordering and runs well with few particles too.

In the end I’d say that the CPU and the transform feedback versions are the most feasible in a real game, since they give you the most flexibility. The transform feedback version actually wins in performance since it runs solely on the GPU and leaves the CPU free for other things.

There’s one final problem for the CPU version though: I’m rendering points! A real game would want to render textured quads! For something that was so RAM bandwidth limited, quadrupling the amount of data isn’t a very good idea. Basically, we need a geometry shader (or instancing) to expand the points into quads on the GPU, and suddenly we lose the main advantage of the CPU engine since we need OpenGL 3! I think that might be what’s holding back most in-game particle engines since they just don’t want to lock themselves to OpenGL 3 hardware for some particles. Transform feedback doesn’t have this problem since it requires OpenGL 3 in the first place.

As usual I’m “working on it”. =S In my defence, implementing transform feedback took me only around 2 hours. Yeah, it was that easy.

http://img269.imageshack.us/img269/2053/transformfeedback.png

I could make a tutorial for transform feedback particles since it was so simple (assuming you know how to use shaders). Anyone interested?

(PS: My current avatar represents my reaction when I discovered transform feedback. =S)

EDIT: Oh, I forgot. Computer specs (a laptop!):

CPU: Intel Core i5 2410m dual-core with Hyperthreading boosting to 2.7GHz in the CPU version. Hyperthreading (4 threads instead of 2) gives me a 20% performance boost.
RAM: 2 sticks of 1333 MHz CAS9 RAM, dual channel.
GPU: NVidia GTX 460m

Also, when I say a particle engine “can handle x particles”, I mean that it runs at 60 FPS with that many particles.

Fun fact: A brand new desktop GTX 680 would have 6.3x the shader performance and 3.2x the VRAM bandwidth. 10 million particles might be possible. drooooool

Cero · June 18, 2012, 2:44pm

Point is you can’t make a game which flatout requires OpenGL3. Not even latest AAA games do that.
But we talked about this of course ^^

No use writing a book in a language almost nobody speaks.

[quote]51% of the Minecraft user base have computers with graphics cards capable of OpenGL 3.0+.
38.8% of the Minecraft user base have computers with graphics cards capable of OpenGL 3.2+.
34.2% of the Minecraft user base have computers with graphics cards capable of OpenGL 3.3+.
19.6% of the Minecraft user base have computers with graphics cards capable of OpenGL 4.0+.
[/quote]
Yeah I know, I know.
Just if you want to sell it as a game, OpenGL2 support would be better: you would sell MOAR

Roquen · June 18, 2012, 3:08pm

Well the question is what’s the current trend look like, coupled with a target date of something to be released…not how the picture looks today, last month or 6 months ago.

matheus23 · June 18, 2012, 3:22pm

theagentd:

matheus23:

Now I think about the tweet from Tiy, (developer in Chucklefish, writing “Starbound”) sayin, their engine would be able to render 10k particles

Well, for 3D I’d need at least 8 bytes more for the additional dimension, and I’d also need a proper geometry shader to expand the points to quads. In the end it’ll probably end up being fragment limited. You’d also probably want texturing, smooth particles (fade them out as they get closer to the ground to prevent a sharp edge) and maybe even lighting, in which case it’ll get even more expensive. There are some tricks you can apply though, like rendering the particles at half or quarter resolution and then upscale them with a smart filter. The blurryness is very hard to notice for smoke, fire, etc, and it reduces by fillrate needed by 4-16x. However, having 1m particles in a game isn’t going to leave much CPU/GPU time left over for the actual game, is it? =S

Okey… I have to correct my self now. I found the tweet from Tiy… he said, their engine could handle 10k particles, with 0 frames drop (he actually said “performance drop”, but I think he wanted to say frames…). Also, their engine (I’m pretty sure) uses OGL2, or even OGL1.?, and they have textured quads, just like in Terraria.

Another Particle Engine I’ve heard - better: I’ve seen - about, is the Unreal Engine 4, which includes a really cool particle engine. But since you have DirectX there, and I’m sure they use DX11, it’s harder to compare these two. But they actually handle some million particles in a fire-smoke-animation, which cast shadows and are effected from lighting. THAT is cracy:

For Particle stuff, see 2:04

acR4n6lJEdQ

ra4king · June 18, 2012, 6:52pm

I wonder how well it would run on my setup, which has a GTX 580. Would also give a good comparison between the 580 and the 680.

Nate · June 19, 2012, 2:20am

Jesus Unreal is nuts.

theagentd · June 19, 2012, 3:11am

Lighting particles isn’t much harder than lighting stuff without deferred shading. It’s pretty inefficient though, so you usually put a maximum of, say, 4 lights or so that can affect each particle system and just calculate lighting and sample the shadow map. Making particles cast shadows isn’t that hard either but you’ll have to sort them by Z for each light which isn’t exactly cheap. What’s reeeaaally tricky is getting particles to cast shadows on other particles. There are ways of doing this for individual particle systems like they do in the UE4 video (Fourier Opacity Mapping for example), but I don’t think that different particle systems can cast shadows on each other. I don’t know, somebody’s probably figured it out. =S

Anyway, particles casting shadows:

Render shadow map of normal geometry.
Render the particles sorted and depth tested against that geometry, with blending to determine how much light they block and store that in a separate texture.
When sampling it, sample it as usual, but if it passes the depth test also read the other texture and modulate the light by that value.

Particles being shadowed by normal geometry.

Render shadow maps as usual.
Determine which lights affect the particle system and pick out the X ones affecting it the most.
Render the particles by sampling the depth buffer and calculating lighting for each particle.

EDIT: Aghhh!! My free time!!! T___T

davedes · June 19, 2012, 4:37am

Nice little tease for those of us running < 3.1.

Can you discuss a bit more about the texture-based particle system? I’m running OGL 2.1 and I’d like to try something like that myself thanks to GL_ARB_texture_float (float textures seem pretty widespread these days).

I suppose it’s a 2D system, and you are using a 4-component texture to store (x, y, vx, vy)?
How do you “write” new values to the texture? Shader + FBO?
How are you rendering the particles? Something like your shader-based tile renderer – a single quad across the screen?
What about blending and overlapping of particle images? And would a geometry shader be moot for rendering, considering the sheer number of particles?

I need to pick your brain.

theagentd · June 19, 2012, 6:17am

I am ironman

EDIT: Never leave your phone near your friends…

ra4king · June 19, 2012, 4:48pm

You gonna get taken home!! :-*

theagentd · June 19, 2012, 6:38pm

Theoretically it should be possible to implement with OGL 2.1 since FBOs are supported through extensions (DX9 has support for it, it just didn’t make it into OGL core). I did indeed use an RGBA 32-bit float texture. Some OpenGL 2 hardware does not support bilinear filtering of 32-bit float textures, but that’s not a problem since we don’t need that. I used a RGBA32F texture for position and velocity, a RGB8 texture for color and a RG16 texture for current life and max life (alpha = current/max, that’s why I needed both). I’ve heard that OGL2 hardware supports multiple rendertargets, but all the formats have to have the same number of bytes per pixel. I’d therefore recommend 2xRG32F for position and velocity and one RGBA16, with color packed in RG and life time in BA. All three would be 8 bytes per pixel so it should work, assuming RG32F is supported by the extension.

For optimally uploading particles you’d need to keep track of which particles “slots” are empty in the textures, meaning that you’d have to update the particles on the CPU too, so I just hacked around this and just kept an index into the texture and simply uploaded particles starting from that index. Assuming your particles do not have widely varying life times the chance of a living particle being overwritten is minimal. I just generate a random particle on the CPU, upload it to a VBO and render it to all textures in one pass with an FBO and a pass through shader that routes everything to the right textures.

Each particle is rendered as a GL_POINT by reading the particle data from the texture, so it’s very different from my tile renderer.

[quote]And would a geometry shader be moot for rendering, considering the sheer number of particles?
[/quote]
On the contrary, I used a geometry shader to (at least try to) improve performance. We don’t know which particles that are alive, so we’ll have to render a vertex for each allocated particle! That is the main drawback of this engine, but using a geometry shader we can at least discard (= not output) particles that are dead. Now that I think about it though, it might be faster to just ditch the geometry shader and cull dead particles by just rendering them outside the screen if they are dead. I was just interested in peak performance so I never bothered to check it.

To show you what I mean:

2 100 000 alive particles, texture size 16384 x 129 = 2 113 536 allocated particles: 62 FPS
0 alive particles, 2 113 536 allocated particles: 87 FPS

The performance depends on the number of allocated particles more than on the number of alive particles.

Blending is easy, I just enabled it. =S It’s worth noting that the particles are not kept in order, so blending which depends on the order of the objects will look very bad. Additive blending will work fine though.

If you want to render textured particles you could use point sprites, though you are still limited to a maximum point size of 64. I’ve also heard that they are buggy in some drivers, but I’ve never personally had a problem with them.

To summarize:

It’s possible with OGL 2.1.
The particle data textures MIGHT have to have the same total number of bytes per pixel.
Don’t use a geometry shader when rendering particles. Instead cull them by rendering them outside the screen.

In the end I strongly recommend against implementing it. It’s complicated to implement and there’s no real performance gain over CPU particles unless you constantly have millions of tiny tiny particles all the time. If your particles are fill-rate limited, this won’t improve performance over CPU particles at all.

Having said that, here’s a link to my test program (shaders included) and here’s the (horrible) source code. It’s missing some utility classes I’ve made, so just use it as a reference.

matheus23 · June 19, 2012, 8:25pm

52 Frames with 3.000.000 blended particles, 67 frames with unblended ones (3mio).

Danny02 · June 19, 2012, 9:06pm

I get 49, 77 and 110 when minimized for 3 mio AMD 5770