A particle buffer implementation

theagentd · October 24, 2011, 9:21pm

A few months ago I experimented with CPU particles drawn using OpenGL and tried to get as good performance as possible with them. I added all sorts of features to make it as quick as possible, for example multi-threading support, keeping particle data in “structs” using Riven’s MappedObject, e.t.c. After having improved it little by little until now, I decided to tackle the biggest problem of it so far: the inflexibility of having a large buffer full of randomly dying particles.

Basically the problem is that any one particle can die at any moment due to its life time running out, leaving a gap in the particle buffer. If I want to draw it all with glDrawArrays, I’d either have to compact the buffer on the CPU or discard dead particles in a geometry shader, but in the shader approach, I’d have to send a lot more data to OpenGL each frame, not to mention the insane fragmentation that builds up in the buffer. I decided to try to keep the particle buffer compacted each frame using the CPU instead.

I’m going to simplify the example a little. Pretend I’m using an Particle[] to store my Particle instances. I keep track of its capacity (length) and its current size (used space), similar to a Buffer instance. If the buffer is full and I try to add a new Particle, I double the capacity similar to an ArrayList.

Now, the real difference in how I do things is how I update my particles. Particle updating happens partly in my createParticle() method (!!!). Why? I do the updating there to be able to locate a dying particle and return it instead of just adding it to the end. This obviously reduces fragmentation and reuses objects, but not that much. In an optimal scenario where 100 particles die per frame and 100 particles are created per frame, it will stay completely compact, but that is pretty unrealistic for explosions, bursts, e.t.c. Just having random life time pretty much guarantees that the number of dying and created particles will be different each frame.

So after creating all new particles, we have a half updated Particle array, which is compact up to the last updated Particle. I have a second method (finishUpdate()) which is supposed to update the remaining particles. First it just continues to update particles until it encounters a particle that dies. Then it continues updating and checks how many particles that die in a row (often just one, but it could be very many too). Then it continues updating and checks how many particles that -do NOT die are in a row. These still alive particles are copied to a lower index to keep it compacted using System.arraycopy(). I then repeat these last two until all Particles are updated. Haha, I guess that wasn’t very clear…

TL;DR: I basically avoid the System.arraycopy on each remove that would’ve happened if I used ArrayList.remove(), only copying each alive Particle instance only once (or not at all if it was kept compact due to newly created particles in the first step).

Obviously things are slightly more complicated compared to how I described it above. I’m using Riven’s MappedObject, so I also have to keep track of a ByteBuffer and copy around the data needed by the GPU (I have a Particle instance paired with a MappedObject containing only the data relevant to the GPU as it is much faster).

My original particle test simply created a new particle every time a particle died, which kept the number of particles constant and the buffer compacted. My current test, a firework simulator, creates particles for firework trails/tracers, and also lots of particles when the firework explodes. All particles have very random life times, but I do shoot fireworks at regular intervals, so the amount of reuse is still quite high.

The performance is great, being only marginally slower than my original particle test. The firework test only runs single-threaded at the moment, so I’m comparing it to my old particle test using only one thread too.

My old test runs very stable at 72-73 FPS with 510.000 particles.
The firework test runs at 69-72 FPS, with particles varying between 500.000 and 525.000.

In short: the performance is excellent compared to my old static test. Oh, did I mention that the fireworks look awesome?

How would you handle a dynamic particle system? Is there a better way? Keeping everything on the graphics card and updating it using draw commands would obviously be a lot faster, but is there any other way of handling the data on the CPU? I though about not keeping the array compacted, and instead generate a list of indices containing only the currently alive particles. This would however force me to loop over the whole array when updating, even if I only have a few Particles which would slow down rendering only a few indices. I’d also have to build that index array every frame, which would be pretty slow if I have many particles. Finally I’d have to send the whole data buffer to OpenGL each frame, instead of only the used part. I’m pretty sure that would be slower…

EDIT: Paint skills!

http://img18.imageshack.us/img18/796/particlehandler.png

Notice how the right-most block isn’t moved twice, only once.

Riven · October 24, 2011, 9:37pm

Use glDrawElements instead of glDrawArrays. Create an VBO with indices. You don’t have to move data around in the ByteBuffer of the MappedObject. You rebuild the entire index-buffer every frame, which will be blazing fast.

Mike · October 24, 2011, 10:11pm

Screenshot!

Mike

theagentd · October 24, 2011, 10:34pm

I mentioned that in my first post, but that would force me to iterate over lots and lots of dead particles, and it’s basically already memory/cache performance limited. I’ll try it out though, just not right now. xD

http://img600.imageshack.us/img600/2568/fireworksd.png

It looks awesome thanks to the HDR and my bloom filter. I also have a twinkling effect on the particles (which is quite slow though, requires a sin() xD). The FPS is extremely limited by my dead slow bloom filter (around 90).

http://img41.imageshack.us/img41/9420/fireworks2.png

These fireworks spew out 5000 particles each, and one is launched each update (at 60 FPS, 60 per second). I’ve disabled the bloom filter (T_T), and as you can see, I have 450 000 particles running at 69 FPS.

Riven · October 24, 2011, 10:48pm

Aww. I stopped reading after [quote]How would you handle a dynamic particle system? Is there a better way?
[/quote]
:emo:

Anyway, you have to iterate over them anyway, to determine what to cleanup/compact. Actually compacting it requires a lot of I/O, especially if you have a dead slot at index 0 (if you can’t fill it with a new particle). At the rate you’re producing/losing particles it’s fair to assume you’re touching most of the data, as your particles have a randomized lifespan. Have you considered moving the data at the end of the VBO to the beginning, instead of pushing everything back? Might save you a lot of I/O at the expense of more complex calculations.

theagentd · October 24, 2011, 10:57pm

Haha, don’t worry. My reason for not using indices are just assumptions. I don’t have any real world data to prove it with. Yet. =D

cylab · October 25, 2011, 8:44am

Would you? Don’t you only need to iterate over currently active particles?

theagentd · October 25, 2011, 5:18pm

Correct, but I also insert new particles into the buffer before the compacting, so I’m trying to avoid this first-particle-dies case. I also copy them in chunks using System.arraycopy() and ByteBuffer.put(ByteBuffer), which apparently is really fast. Everything is also done in a single pass, so it seems to be quite fast. xd

I only have to iterate over the particles that survived the last update. Basically:
Number of updates = the number of alive particles
Data copied = between 0 and the number of surviving particles after updating (worst case scenario obviously, usually much better)

How would putting the data in the end of the VBO help

Nice idea! Fill in the holes with particles from the end of the list instead would indeed only have me copy a number of particles based on how many that died, not how many that survived.

Well, obviously, but these active particles are spread out over the whole buffer, so I thought I would have to iterate over the whole buffer to find the active ones. I now realized that I can just use the indices in the index buffer to find them. This still leaves me with having to send the whole data buffer to OpenGL each frame instead of only the active part, plus an index buffer, which would have to be 32-bit ints due to the number of particles…

I’ve been trying to profile my little firework test, but MappedObject’s fork() makes the whole game a black box! T_T I think I’ve found out why I get so good performance though: 98-100% of the particles are already compacted after inserting new particles. The majority of all updates require no copying at all, and when I need to copy them it’s only about 6 000 out of 255 000 total particles. Basically the more particles you create, the better the performance becomes. xD If I suddenly stop firing fireworks, the number of copies peak at 25 000, which isn’t much at all in my opinion.

Riven · October 25, 2011, 6:04pm

32 bit index buffers are dog slow. I’d advice to not even bother in that direction.

Well yeah, it aint pretty.

theagentd · October 25, 2011, 6:12pm

I could keep them in batches of 65536 to use short indices. I’ve also heard bad things about 32-bit indices…

delt0r · October 25, 2011, 6:23pm

In my profiling batching buffers with short indexes is quite a bit faster than using one 32 index one. It was something like 500k indexes IIRC. But this wasn’t tested on new hardware. IIRC the 8800GT was the newest one tested and no ATi.

SmokeNWrite · November 24, 2011, 1:26am

Might start work on a new particle engine

Yeah, I think I might, hey when can you post links?

theagentd · November 24, 2011, 5:05am

http://i.istockimg.com/file_thumbview_approve/897579/2/stock-illustration-897579-chain-link.jpg

Did someone say links?