HDR Fireworks + Particle Engine benchmark

Cero · November 9, 2011, 1:12pm

yeah no, my game is really fast paced and not a casual/mini game at all; and it might be true, a video card that only supports 1.1 may even be too slow too even run the game at all…
Well not sure, hard to find suitable hardware. The low performance machine I like to test our game on is a Acer Aspire, low RAM, slow CPU, crappy video card which doesnt vsync and supports, I don’t know 1.7 or so

on the other hands I don’t think I need anything higher; well I will see if going to 1.4 or 1.5 makes a difference when we introduce lightsources
but oh well light sources in 2D are quite simple anyway…

ReBirth · November 9, 2011, 1:25pm

@cero @theagentd
thanks both of you.

because I love my lappy, I tried 1 - 15 - yes and got 32000+ particles on… 27 fps :3

theagentd · November 9, 2011, 1:52pm

After finally getting some time off from studying for tests (went pretty well xD) I managed to finish my GPU accelerated particle test. It’s the same test as the one I got Ra4king to test. That test was basically optimized as far as possible for CPUs. Multithreading, mapped objects, object pooling and other optimizations I’ve added have made it pretty much as fast as possible. It doesn’t scale very well on multiple cards though, due to the single-threaded copying of the particle data to the OpenGL mapped buffers. It’s not very expensive, but the more CPU’s you throw at the buffer updating, the lower scaling you get. See Amdahl’s law.

Ra4king’s testing showed a 200-250% scaling and 2.7-2.8 million particles at 60 FPS on a hyperthreaded quad core CPU
My own hyperthreaded dual core gave me about 160% scaling and ~1 million particles at 60 FPS.

—BORING TECHNICAL EXPLANATION STARTS HERE—

Simply put: I need to avoid the copying the data to the GPU each frame. The solution is to keep everything on the GPU and update it there. My implementation keeps 3 textures for this: A RGBA 32-bit float texture for xy position and xy velocity, a RGB 8-bit color texture for particle color, and a RG 16-bit life texture which stores the particle’s life time and current remaining life. Particle alpha = current life / life. The 2D textures are made big enough to fit the target number of particles. The theoretical maximum number of particles for a single set of textures is 8196 * 8196 16384x16384, more than 64 268 million particles, but performance obviously becomes the limit long before that.

I add particles by adding all the new particles for a frame into a ByteBuffer and drawing each particle as a point to all the textures using MRT. I simply add particles in a row for row, column for column, continuing where I left off the last frame. I simply assume that when I have written all particles to the whole texture and have to start at the beginning again, that particle has already died. I basically overwrite the oldest, hopefully dead particle each time. This avoids the need to have to find an empty place in the buffer each draw.

The next step is updating the particles. They are updated through a simple shader which is applied to all the pixels (= particles) in the textures. I obviously only do this to particles which life is over 0. It updates the velocity with gravity, approximated air resistance and screen edge collision bouncing, and then position based on velocity. I also reduce the current life by 1. I actually have two position/velocity and life textures, so that I can ping-pong the updating between them each frame.

Finally I draw them. I do this using a single glDrawArraysInstanced() call. I have a buffer with integers running from 0 to the particle texture height. I then draw (particle texture width) instances of this data. The final texture coordinate in the shader becomes this:


texPos = ivec2(gl_InstanceID, y); //y is the value from the buffer

In other words: I draw a point for each particle. This texture coordinate is passed to the geometry shader, which first samples the life texture and checks if it is alive. If it isn’t, it returns and no more textures are sampled and nothing is drawn. If it is, it samples the position/velocity texture for the position, the color texture for the color and calculates the alpha the life time. The fragment shader just writes the particle color.

This is obviously pretty slow even if I don’t have any alive particles, so it’s important to keep the texture size as low as possible. Simply updating and rendering the empty 1000x1000 particle textures takes about 5 ms.

—BORING TECHNICAL EXPLANATION ENDS HERE—

My performance? 3 million particles at 60 56 FPS. Yup. A three followed by six zeroes. The catch? I had to disable GL_POINT_SMOOTH. For my CPU based test, all antialiasing was free due to the fact that it was so very CPU limited that any GPU eye candy was free. I don’t think that’s really the problem here though. I seem to be fragment limited. Yeah, what the hell? Enabling GL_POINT_SMOOTH makes the points cover 4 pixels instead of just 1 pixel, plus the extra coverage computations. With antialiasing, I get about 2.4 2.1 million particles at 60 FPS. I think I’ve basically reached the limit of what my graphics card can possibly ever accomplish. I could optimize my draw calls and my texture layout, but I seriously doubt I can crank any more performance out of this. I might be able to improve the constant FPS cost of simply updating and drawing all the dead/uncreated particles a lot by developing a better draw algorithm, but the performance wouldn’t be better when you actually have that many particles.

Now, if I only had a certain someone to test this on his 3 times as powerful GTX 570, we’d be able to see some real performance.

EDIT: I had some incorrect data, not all particles were alive. Not a big deal though. =P

ra4king · November 9, 2011, 4:59pm

Ooooh me me! Pick me!!

theagentd · November 12, 2011, 5:50am

I guess this will be my final post in this thread unless anyone has any questions.

Ra4king tried the little bastard of a program out and got 5 million smoothed particles dancing over his screen. I could very easily port the GPU accelerated particle engine to this firework test, but it would be rather useless as the particle engine would compete with the bloom filter over the GPU.

To just shortly dive into the OpenGL 1.1 vs OpenGL 3 discussion again: This would be completely impossible to do without OpenGL 3.0. Framebuffer objects, float textures (for particle positions and velocities) and geometry shaders are all needed, but one of the most important things needed is instancing. Using instancing every single particle of those five million particles is drawn using a single draw call. This reduces the CPU cost of drawing everything to basically nothing. For creating new particles, updating and drawing, I’m doing far under 100 draw calls.

Well, if I ever want to learn OpenCL, I guess I know what to do…

EgonOlsen · November 15, 2011, 1:33pm

Doesn’t look right on a Radeon HD 5770 on Windows 7 64bit using the 11.9 drivers either after pressing the mouse button. It’s not 100% b/w but only the rockets look normal. The explosions are pretty much b/w pixels like in my screen shot except that they get some subtle coloring when fading out. But at least it’s possible to return to the normal rendering when pressing the mouse button again.

theagentd · November 15, 2011, 6:38pm

There’s not much I can do, actually. I’m stuck on a laptop with an Nvidia card. In Japan. It works on most cards, but not on some Radeon cards. It has to be a driver bug, and since I’m unable to test it myself, the only thing I can actually do is send you the source code for you to experiment with it for yourself. I mean, I’m interested in the results and so, but I don’t really want to bug test a driver on someone else’s computer over a forum. It’s a little bit too slow… >_>