Optimizing a Particle System

Hey guys,

I’ve recently ported over a simple Particle system from C to lwjgl and I’m trying to optimize and extend it a bit.

In C, the author has each particle rendering itself as a single 4 sided GL_POLYGON using begin and end and 4 vertices. This approach obviously results in a lot of JNI overhead in Java. I moved the begin and end outside of the particle’s render method and changed from POLYGON to QUAD (as the red book suggests) but of course, that only helps a tiny amount.

What I’m wondering (yes, I’m finally getting to the point :o) is if any of you have any tricks up your sleeve for optimizing sending large numbers of vertices to OpenGL. Since the positions, sizes and colors are dynamic, I know I can’t use display lists. I’ve used Vertex Arrays elsewhere in my app, but I don’t know if I that would help here since each round through the code would mean reconstituting the arrays.

Any ideas?

You use, in order of preference:

ARB_vertex_buffer_object
ATI_vertex_buffer_object
NV_vertex_array_range2
NV_vertex_array_range
EXT_compiled_vertex_array

and you use them with vertex arrays, not individual glBegin()/glEnd calls.

And you never draw GL_POLYGONS, only GL_QUADS, although GL_TRIANGLES are even better, being 3/4 of the work.

Filling your arrays with data every frame is entirely reasonable seeing as in every frame your particles are going to move and quite likely to change at least one aspect of their colour (alpha, I bet). Provided you write them contiguously in memory you’ll be using your cache effectively. And if you’re using AGP RAM courtesy of whichever GL extension you’re using you will be getting the absolute best performance there is.

Cas :slight_smile:

If I understand correctly, the array data gets sent to the card in one shot, whereupon the card may (or may not) store this data locally. Is there then a restriction (or perhaps best practice) on the amount of data I send via Vertex Arrays?

The reason for using something like a Quad in this system is that we are texturing the quad with different images to assist in the particle illusion. Or are you suggesting using two triangles instead of a Quad?

Sorry, I should clarify that only one texture per emitter is used. (my earlier post might have sounded like each particle could have a different texture.)

In general, graphics drivers are are optimized for triangles. How they are process and how they are trendered will often be more efficient that the handling of quads. I’ve heard some drivers convert quads to 2 tris anyway, but I don’t know the validity of that.

All consumer cards to my knowledge convert quads into a pair of triangles to render anyway, but this may or may not be done on the server (card) - it doesn’t really matter most of the the time.

The trick with triangles though is that you can have your image data stashed in the bottom left of a square texture and then use a triangle to render it. Empty pixels are of course very nearly totally free; you eventually end up billboarding only 3/4 of the coordinates, transferring 3/4 of the data over the bus and the card only has to transform and project 3/4 of the vertices. This is almost always an excellent performance win giving you typically a 25% speed improvement over using quads for particles assuming you aren’t ultimately limited by fillrate.

Also it’s worth remembering that the data is not transferred in one shot to the card. The call is made but once and the driver will typically set up some DMA and the card will then suck the data straight out of AGP RAM in the background while you do something else. The card is effectively a separate processor running in parallel to the CPU.

Sometimes though that won’t help and a call to another GL function call will stall waiting for whatever DMA operation is going on to complete. The more sophisticated the card and drivers, however, the more stuff it can queue in the pipeline.

You can do anything you want whilst it’s off rendering things though. Typically you’re interleaving a rendering operation with a buffer calculation and fill operation (eg. rendering the first unlit pass of the background using the GPU whilst at the same time on the CPU you’re calculating lighting normals for lights and filling the next set of buffers up).

Cas :slight_smile: