Yet another particle engine update!

theagentd · November 10, 2012, 5:12pm

Hey, guys. I know you’re getting tired of this but I just can’t stop it! ;D

I managed to get 3.7 million particles running at 60 FPS!

http://img820.imageshack.us/img820/3881/particlesa.jpg

This latest version is simply an improvement to fix a problem with the old transform feedback particle engine: It didn’t work with SLI (multiple GPUs). The driver does not explicitly synchronize the buffer memory after using transform feedback between GPUs, so trying to render particles from the last frame simply did nothing. I solved this by letting each GPU have its own particle buffer and update it twice (for 2 GPUs that is) but only render it once. That way my two GPUs can work with only their own feedback buffers and no driver synchronization. It’s not very effective of course since the updating is has to be done twice per GPU now, but at least the rendering of the particles is only done once per GPU which pretty much doubles fill-rate. Even with just my smoothed pixel-sized point particles I got a performance increase from 3.0 million particles to 3.7 million particles, an almost 25% increase in performance. Note that this was with an Nvidia GTX 295 which came out in January 2009; high-end at the time but not very spectacular today.

The main limitation at the moment is actually memory usage. My transform feedback code is extremely unoptimized when it comes to memory usage (I could reduce it by 25-30% with relative ease). The real however problem is that the driver isn’t smart enough to figure out that the buffers are only used by one GPU, so they are both allocated on both GPUs. For 2 GPUs, I need 4 full particle buffers to be able to ping pong between two of them on each GPU. 3 700 000 * 36 bytes * 2 * 2 = 508MBs of data… Of course, you don’t need that many particles in a real game, so memory usage will be a much smaller problem there. The high efficiency of this technique combined with the fact that I got it working at all on SLI/Crossfire systems still makes it worth using even if you “only” have 100k particles or so.

As I wrote above, fill-rate is basically double what it was before, while the cost of updating particles is the same. With 3 million particles with a point size of 1 (4 pixels covered per particle due to smoothing) particles I “only” got an almost 23.3% increase in performance (60 --> 74), but with 100 000 particles with a point size of 43 (1 849 pixels per particle) the performance increase was 93.3% (60 --> 116). Fillrate scales linearly with the number of GPUs, while particle performance does not scale at all. My program can handle any number of GPUs (= up to 4, a limitation of SLI/Crossfire), but memory usage may become a problem on quad-SLI systems. =S

It should be possible to further optimize this by simply doing the update once on each GPU but with a twice as high delta, but this may cause inconsistencies between the GPUs due to floating point errors that build up over the life time of a particle. Might be worth investigating though since particles generally live a very short life time. I’d estimate performance of such an implementation to at least 5 million particles on my graphics card since it’d scale perfectly with any number of GPUs.

Danny02 · November 10, 2012, 5:49pm

ok, thats it!!! >:(

CHALLENGE ACCEPTED!

matheus23 · November 10, 2012, 7:01pm

Wut? You try that too now?

Ultroman · November 10, 2012, 9:07pm

Everyone get a chair and some popcorn!

Sickan · November 10, 2012, 9:10pm

I’m jealous and amazed of the wizardry you pull off.

Cheers! 8)

theagentd · November 11, 2012, 12:44am

Heh, I quickly hacked together a new version which does what I wrote before:

However, I “solved” the floating point problems by simply doing the updating twice for the particles that needed it in the shader with a for-loop instead of doing the whole transform feedback thingy twice. This turned out to be free (it’s probably memory bottlenecked). I just hacked it all together, so it explodes for >2 GPUs and I honestly don’t know exactly how it’s working ^^’, but I’ve compared it frame by frame with my original (non SLI) version and it’s identical. :persecutioncomplex: Performance speaks for itself: 5 790 000 particles at 62 FPS. If I increase the number of particles any more than that I run out of VRAM (I only have 896 MBs, minus the 41MBs Windows uses) and performance drops to 1-5 FPS due to swapping. Seems like optimizing memory usage would improve performance too since it seems to be bottlenecked by that.

I’m, uh, awaiting your counter-attack? =S

Joshua_Waring · November 11, 2012, 5:09am

My particle system can only get 100K particles at 23fps… although that’s only one thread on the cpu… and only uses a few megabytes and has point gravity…

EDIT : My counter is better than yours! cough

theagentd · November 11, 2012, 5:22am

I get around 2100-2200FPS (<0.5ms) with 100k particles…

javaw.exe uses 25 MBs of RAM. VRAM usage is around 31MBs, including 6 MBs for the 1080p framebuffer (theoretically the particle buffers uses around 14MBs). CPU usage close to 0% since the only thing the CPU does is generate new particles and issue a few OpenGL commands per frame.

Joshua_Waring · November 11, 2012, 5:28am

The point here is GPU completely devastates the CPU for these calculations. I would like to learn OpenCL and LWJGL has the capabilities

I would also like to test it at home on my 6970.

PS is there a .jar we can play around with :)?

Sickan · November 11, 2012, 2:19pm

I’m sorry if this sound a bit leachy, but can I see your source code somewhere? I’m very, very curious! Thanks in advance!

ra4king · November 11, 2012, 3:35pm

I second this notion.

I wonder how many particles my GTX 580 with 1.5GB of VRAM could handle…

Danny02 · November 11, 2012, 4:02pm

time flies so fast 8)
didn’t even start yet really(only some research)

would be nice to compare later the sources and do real benchmarks. What kind of features does your particle system/simulation do?

theagentd · November 11, 2012, 5:11pm

I discovered a stupid bug in my barely working new SLI version. The ordering of the particles is different between GPUs. This of course doesn’t affect performance, but does cause some flickering. I didn’t see it since the particles are so small they rarely overlap, and when they do it’s so chaotic it’s impossible to spot. When I increased the particle size I could easily see it. Well, it doesn’t affect performance, so meh.

Not yet, I’ll throw something together… Will be interesting to see how well it performs on other architectures. My old GTX 295 seems to work pretty well with transform feedback considering that it’s very fast compared to my older shader/OpenCL implementations, but my laptop’s GTX 460M takes a pretty big performance hit from it, probably because it has a lot less memory bandwidth. It seems like it’s very tied to which architecture the GPU has, so it will be very interesting to see how it performs on later Nvidia cards and Radeon cards. I plan to release the single GPU version soon.

Note that transform feedback has nothing to do with OpenCL. It’s just an extension to OpenGL that’s available to OGL3 cards (core in OGL4).

I’m still planning on creating a tutorial on transform feedback. I might be able to do it tonight, but don’t count on it…

Right now? Almost no features at all. Not even texturing. The point is that after updating the particles with transform feedback you have a perfectly formatted VBO with data that you can do whatever you want with. Want to draw an asteroid 3D model for each particle? Just use instancing. 2D sprites? Use a geometry shader.

StumpyStrust · November 12, 2012, 6:25am

hmmm…this would take just about everything off of the cpu. The only question I ask is how much of the gpu do you take away? It is great for just some simulations but when it comes to actually using it in a game you still have all those triangles you be needing to render.

theagentd · November 12, 2012, 8:20am

Performance of the AFR version on my GPU was 358 980 000 particles per second (5 790 000 * 62) so around 350 million particles per second, including some cheap rendering. 1 million particles runs at around 350 FPS, so it the math seems to work out correctly. Since particles are usually fragment limited, I think the gains from completely eliminating the updating and reuploading from the CPU and doing it basically for free on the GPU is a good thing. 100k particles should in theory run at 3500 FPS and therefore takes around .28ms to update and render. When I get home (I’m at uni) I can benchmark it with rasterizing disabled to check the raw updating performance of it.

Joshua_Waring · November 12, 2012, 9:53am

I assume you’re using OpenCL?

Danny02 · November 12, 2012, 5:21pm

so university is a bit boring atm, so i threw together the first version:
~4mio at 30fps
~16mio at 5fps

running only on my laptop with a NVidia 550, but without driver on mesa linux
have to test it one my desktop^^

theagentd · November 12, 2012, 5:53pm

Not at all! Just OpenGL 3 with extensions!

How are you updating and drawing your particles?

EDIT:
Benchmark without any rendering (only updating with transform feedback):
5 000 000 particles at 110 FPS = 550 000 000 particles per second. One million particles take about 1.8ms to update. That’s definitely less than it would take to just upload that amount of data each frame.

If anyone missed it: http://www.java-gaming.org/topics/opengl-transform-feedback/27786/view.html

Joshua_Waring · November 13, 2012, 7:21am

Would we expect this much performance from openCL O.o?

theagentd · November 13, 2012, 7:45am

In some ways it might, but keep in mind that for me, OpenCL performed identically to OpenGL using textures to store the data and a shader to update them. Without transform feedback, you also have a huge problem of keeping track of which particles are alive. It has good peak performance when all particles are alive, but is almost as fast when you have no particles at all, since you’ll have to render a point for every allocated particle each frame to find the alive ones. Transform feedback solves this since it compacts the alive ones to the beginning of the VBO and also allows you to draw only the number of alive particles, but might be a little slower on some hardware. I also don’t think that multi-GPU rendering will work with OpenCL.

OpenCL was a bit disappointing. It’s really only faster if you manage to utilize the shared memory of the clusters to do calculations (which you can’t for particles) and even then, it might not be faster. Just getting it up to OpenGL in performance was hard since you need to care about how you read memory so its aligned and stuff.