Particle Optimization

So I have been goofing around with particles for a while. I made a simple system in regular java 2D and then redid the system in openGL using LWJGL.

Its 2D not 3D as the original was only 2D.

I am using the fixed function pipeline (glBegin glEnd )

I can get about 15k particles with a steady 60fps. At 30k it drops to 30fps.

I have not tried VBOs because there are very few detailed tutorials I can find on them that use LWJGL and are not out dated.

I have tried using vertex arrays and they would be faster if particles did not move. But, because particles move you have to rebuild the vertex array every frame.
(at least from my understanding)
Even while doing this I can get about the same performance as fixed function. If I do not rebuild every frame I get about 40-50k before the fps drops to 30 and lower.

Any tips/ideas from speeding up the system? Links to good tutorials on how to use VBOs or vertex arrays? ???

PS: It already looks great but I just want to speed it up enough that it could run on most computers. ;D

Vertex arrays should be far, far faster than glVertex calls. How are you using them? Can you describe your render loop?

Every frame I create an array of all the particles positions. The particles are textured triangle strips as from what I have heard, they are faster then quads.

This means that each particle will have 4 points each point has 3 coordinates. 4 * 3 = 12.

Once thats done, I create a float buffer, put the array into it, Flip the buffer and drawArrays.

I do not know if making an array of 360k values every frame is what is slowing it. (30k particles * 12 = 360k)

If I don’t remake the array every frame I get close to double the performance. But, stationary particles are kinda lame. :smiley:

u can use point spirits, which means u upload only a single vertex per particle.

You would do it either be using the point spirit extension of opengl which will expand a vertex for u to screen aligned quads, or you can use vertex/geometrie shader to expand the quad/triangle your self

With libGDX using VBO:s is dead simple. Just create mesh and its done. You want to do static vbo and update particles with shader. One time uniform should be enough. I get about 50k particles on android with that setup and its fill rate limited. Textured points are fastest but support is very flaky. Quads are lot simpler and better with big particles but triangles scale better with small particles, but uv coordinates is damn hard to get right and texture need some padding to avoid problems.

Vertex shader look something like that:


 vec4 pos = a_position;
 v_texCoords = a_texCoord;
  float life = (time - a_time.x);
  if (life > 0.0){"
    float lifeLine = 1.0 - (life * a_time.y);//time is inversed 
    v_color = a_color*lifeLine;
     pos += vec4(a_vel, 0.0, 0.0)*life;
     pos.y += life*life*-9.81*0.5;\n"
  }"
gl_Position = u_projectionViewMatrix * pos;

Yes I was going to try point sprites but every thing I have read about them says that they can sometimes work and sometimes not.

Oh and I was staying away from shaders because I thought that android did not support them. I also was staying away from them because they are kinda confusing. ???

And will libGDX work well with lwjgl?

It seems for me at least that the problem I am having is a lack of tutorials on how to use VBOs, shaders, and what not, with java. I just don’t know enough to be able to convert c++ opengl code into lwjgl java code. :cranky:

Point sprites are very iffy propositions, and you’re wise to steer clear of them. You may find they look fine on one machine, and another one renders them at a fraction of the size you wanted.

Android supports OpenGL ES 2.0, which more or less supports nothing but shaders. GDX also has an lwjgl backend. GDX is a medium-level API however, and for sprite manipulation you don’t actually have to muck with the shaders and VBOs and whatnot yourself.


	public void render() {
		time += Math.min(0.1f, Gdx.graphics.getDeltaTime());
		if (time > 4) 
			time=0;
		
		Gdx.gl20.glClear(GL20.GL_COLOR_BUFFER_BIT);
		Gdx.gl20.glEnable(GL20.GL_TEXTURE_2D);
		Gdx.gl20.glEnable(GL20.GL_BLEND);
		Gdx.gl20.glBlendFunc(GL20.GL_SRC_ALPHA, GL20.GL_ONE);
		texture.bind(0);
		shader.begin();
			shader.setUniformMatrix("u_projectionViewMatrix", camera.combined);
			shader.setUniformf("time", time);
			particleMesh.render(shader, GL20.GL_TRIANGLES);
		shader.end();
	}

This java code here is all that you need to render particles with VBO + shader. Shader code is less than 10 lines. Initializing particles to mesh is about 10lines too. It’s dead simple after you understand the pipeline. Fixed function pipeline is lot more confusing.

Just read this http://www.arcsynthesis.org/gltut/ and start full blown shader stuff with libgdx.

If I may tempt you with a screenshot:


http://img849.imageshack.us/img849/322/particleengine.png

1 000 000 particles, multi-threaded, 65 FPS on a laptop i5-2410 at 2.7GHz, and still CPU-limited.

You have a long way to go, young padawan!

Any decent GPU can process 1 000 000 triangles at 60 FPS. The real problem is fill-rate. I draw my particles as simple points. However, once I go over a point size of 4 it starts being GPU limited. This is the same pixel area as a 4x4 quad. Point smoothing further increases this to a 5x5 quad to be able to do its anti-aliasing and increases the cost of each pixel because of the coverage calculations. On top of this we also have blending which further increases the cost of rendering each pixel slightly.

All in all: 5x5 quads = 25 pixels per particle. 25 x 1 000 000 = 25 000 000 pixels to process each frame. For reference a 1920x1080p monitor has about 2 million pixels. I’m pretty much filling all pixels of such a screen 12.5 times.

If I turn of point smoothing with a point size of 4 the GPU-load decreases a lot, and I can manage a point size of 6 at 68 FPS. 6x6 (square) points x 1 000 000 particles = 36 000 000 pixels per frame. GPUs are awesome! =D

Particle rendering benefits a lot from OpenGL 3.0 hardware or hardware supporting the extensions needed from OpenGL 3.0. More specifically you can render your particles as points and then expand your points to quads (triangle strips) in a geometry shader.

So far I’ve focused on particle count and how to increase it. Just remember that many optimization attempts become worthless if your particles simply have a too large pixel area. If you have a 50x50 smoke particle on the screen, they cover 2 500 pixels each if rendered as a quad. Divide the earlier 36 000 000 pixels per frame and we get around 14 400 particles per second. This is of course a very rough estimate, but no amount of geometry shaders or CPU multi-threading is going to increase performance in this case. The only real optimization that you can do is to use more vertices. Your smoke particle texture is most likely round to not give an impression of actually just being a square texture. By approximating a circle using 16-32 vertices you can reduce the pixel area to something closer to the circle area equation (A=PIr^2) instead of a square’s area (side^2). The same 50x50 smoke particle can be rendered as a circle with a radius of 25. 50x50 is still 2 500 pixels, but a circle with r=25 has the area (3.141525^2) which is equal to ~1964 pixels. Suddenly we can have around 18 000 particles in a single frame! And yes, the number of vertices increased by 16-32x, but remember that we just pushed millions of vertices when we rendered smaller particles! 18 000 particles, each made of 16 triangles is still only 288 000 triangles. Additionally, these circles can be rendered using instancing which prevents the CPU/bandwidth nightmare of having to replicate all the particle data for each vertex.

TL;DR:

  • It’s possible to render millions of particles per frame at 60 FPS if they are small enough.
  • The modern replacement for point sprites is a geometry shader that expands a point to a quad (a triangle strip with 2 triangles).
  • Larger particles are severely fill-rate limited, so rendering them as approximated circles reduces the pixel area by about 20%.

Yes! Finally! These are the answers I have been looking for. And I know, I still have so much more to learn. 8)

I will probably revamp the whole thing and use shaders.

I understand the fill rate and why 1,000,000 particles is simply ridiculous but I don’t really want to have 1,000,000 that is crazy. :o
I know that gpus can do a whole lot more than particles as most games particle systems are just one small part they are rendering. My gpu only uses about 10% or less with my current particle system.

My current system I uses textured triangle strips with some of the textures being fairly large. I also have some physics involved.
(not collisions with each particle but things like gravity wells and what not)
The physics drops fps by 1-6 when in use. I posted what I have in the showcase as its pretty fun to play with.

My goal is to make 10k run really smoothly on the android and 50-100k on the desktop. I have a long way to go. :cranky:

But really 1,000,000? their not fancy particles but still damn.

Well, there’s more where that came from:

http://img714.imageshack.us/img714/4159/particleengine2.png

This is 2 000 000 blended and smoothed particles at 64 FPS running in only one thread on the exact same computer. I also had Ra4king test this version on his desktop computer and it managed 5 000 000 smoothed particles at 64 FPS, and around 9 000 000 unsmoothed particles at ~60 FPS.

Gasp! How did I do it? =D

Those results were from when I had the GTX 570. Running the GPUTest again, the GTX 580 gets me 9 million smoothed particles and 11 million unsmoothed particles at 60 FPS. ;D

I think my computer would die at like 500k.

Hahahahahahahahahahahahahahahahahaha

Oh yeah!!! I present to you 1,000,000,000 particles with ambient occlusion, Depth of Field, Motion Blur, HDR, AAx32 and UberSamplingx9000!!!

http://img193.imageshack.us/img193/4269/13777756.jpg

What now?!?! >:(
Oh did you see the fps and cpu usage. Yeah I am a boss… :cranky:

Grrr… Your graphics card is about as big as my laptop. -_-’

As you might have suspected the second version offloads everything except particle generation to the GPU. Particles are stored in a few textures and updated by pingponging between two sets of these with a shader. Rendering is done by generating a vertex for each possibly alive particle (= for all texels in the texture) and then checking if each particle is alive in a geometry shader. If it is it samples the needed data from the textures and passes it through to the fragment shader. It’s a pretty crude attempt at using OpenGL for computing instead of graphics and it has some serious drawbacks because of it, but it’s still ridiculously fast compared to updating the particles on the CPU and then copying them to the GPU each frame. For the 11 million unsmoothed particles Ra4king managed to achieve that’s a lot of bandwidth saved. The data needed to render my minimal particles are 2D positions and RGBA colors, with alpha being equal to (lifeLeft / lifeTime) of the particle in question. For the CPU version 11 million 12 byte particles equals 126 MBs of data sent to the GPU each frame. At 60 FPS per second, that’s 7.37GBs per second, which obviously is ridiculous. For the GPU version I just store all the data on the GPU in the first place. I have position and velocity in a RGBA 32-bit float texture (RG = X Y, BA = VX VY), color in a standard RGB texture and the current life and total life time of each particle in an RG16 texture.

Again, this is a huge hack and I just did it for fun. The proper way of doing GPU particles would be to use OpenCL (that’s a C, not a G) to process the particles on the GPU. This would be faster, simpler and more flexible, but most importantly it would adapt a LOT better to varying particle count. The GPU version has to generate a vertex for each texel in the particle textures, regardless of whether the particle actually is alive or not. With OpenCL I could keep the data on the GPU like I do now, but using OpenCL I can pingpong between VBOs instead of textures. Throw in atomic counters and everything becomes so clean that it’s almost scary. There’s two problems with all this though:

  • How many particles are actually alive? I could count them using atomic counters on the GPU, but how do I get this back to the CPU to tell the GPU how many particles to render without killing performance? And
  • I haven’t gotten around actually trying out OpenCL. xD

@StumpyStrust’s latest post
Lol. And sorry for having spent (still am spending) way too much time of my life on optimizing particle engines. xD Also, my version would have had better anti-aliasing since GL_POINT_SMOOTH produces all possible shades = 256xAA, but I forgot your uber-sampling. Also note that 1 billion particles would at a minimum take 16 bytes per particle (2 floats for position, 2 floats for velocity, all have the same color and last forever), equaling almost 15 GBs of data. xDDD

I hate you so much :confused: I get like 100.000 particles running around my window and you just so like 5.000.000… >:(
I should just make you my OpenGL guru I guess :smiley:

I just happen to like optimizing stuff. That’s why I haven’t finished a single game yet. OpenGL (sadly) isn’t going to get you a complete game, just some fancy colors. Wait, that sounds like drugs…

ROFLMAO ;D

So as I said before, in my system their is some physics involved and when its active I get a drop of 4-5fps. Now this is not much but I want more complex physics which would mean a greater drop in fps.

Example: all particles testing each other for collisions. A grid based system would improve performance but for a things like water and what not grid based systems would not look very good.

One I idea I have on improving the performance of the calculations is using bitwise operations. Now from what I understand division is the biggest offender when it comes to speed. So would it be a good idea to cast things to ints so you could use bitwise operations on them? You would lose some accuracy but for some non real world physics simulations I don’t think that would be a big problem.

Also, is casting an double/float to an int take that long?