Libgdx vs Stumpy: Sprites

So now that I have my spritebacher working like a charm. ;D I wanted to put it up against libgdx’s sprite batcher and see if it is worth anything.

My spritebatcher simply uses vertex arrays and not VBO where I think libgdx uses VBO. I also am not using libgdx’s vector class which should give a performance boost for libgdx but I do not know how much. I also may not be using the best method for rendering mass sprites in libgdx. Just batch.begin, render sprites, batch.end.

I will admit I looked at libgdx’s sprite batcher to see how they did things and based mine similarly off theirs. (mainly just the rotation and texregion part ;)).

Sorry for the large files but they both come packed with all natives and everything so…

Controls.
G to generate a bunch of particles/sprites
K to kill them.

libgdx test case

http://www.mediafire.com/?ccomdyb7inpvocy

Notes: no delta so seems slower but looks better.

On my desk top the performance for libgdx is about 60fps at 50k sprites. Very good. Once you go past this tho the performance drops considerably. 30fps at 100k. I think that, that ratio is very nice.

On my laptop with integrated graphics everything is much slower. 50k 40fps or lower. 100k makes me cry.

Stumpy’s test case

http://www.mediafire.com/?wl4w7lenk4f17i7

Notes:uses delta timing so it seems faster but looks bad due to delta can throw things off with low frame rates. Also, don’t use the popup menu to generate particles/sprites as they will be different colors thing libgdx test case.

On my desk top performance is almost identical to libgdx except at 100k it is 1-3fps faster.

On my laptop libgdx is much faster then my sprite batcher 5-8fps. I think this is because my laptop gets fillrate limited almost instantly. In my sprite batcher I do absolutely nothing to lower the fill rate and I have no idea exactly what libgdx does but they are faster.

I would like you all to test the speeds to see what you get.

At 150,000 particles, both give me 20 FPS. At 5,000 particles, yours is at 50 FPS and LibGDX’s is at 40 FPS.

The tests don’t seem very consistent; for one, when the particles are clustered together the FPS spikes because my GPU seems to use some sort of culling optimization to reduce fill rate.

I’ve looked into LibGDX’s sprite batch, and there are a few areas that could be improved (not necessarily for LibGDX’s needs, but perhaps for a specialized game engine).
[]Mapped VBOs could be used; although Vertex Arrays (the default implementation in LibGDX) may be just as fast if not faster
[
]A FloatBuffer could be used instead of putting it all into an array first
[]Calling begin() and setShader(…) will send uniform data to custom shaders, making them not necessarily optimal
[
]There are a lot of redundant calls to glEnable, glBlendFunc, glUseProgram, glBindTexture, etc.
[]Advanced features like geometry shader, multi-texturing, multiple render targets, texture arrays, etc. could be employed on desktop.
[
]Polygons could be drawn in the same batch as regular sprites since in the end it’s all just textured triangles
[]A shader-based approach could be employed, like theagentd has done with his tile renderer.
[
]There is no use of z-buffer for depth, so naturally you may run into texture swapping with multiple sprite sheets

I have talked with Mario about that.
We did some benchmarking over a year ago and found that especially on modern NVidia cards VAO are horrible in performance, and some internet posts suggest its because its deprecated with OpenGL 3 or something. We ended up using VBOs, although its entirely possible that we screwed it up back then.

I recall that it needs OpenGL 3.

Well yeah, since I have a big Desktop game using Libgdx now, I would welcome nice Desktop only features… however the guys have too much to do as it is.
@Nate Still waiting for particle system refactoring and fixing x3

5,000 particles? I don’t think they will let you generate that low.

Do both libgdx and Stumpy do the clustering FPS thing? If so I really don’t see an issue. If it is your GPU doing that I cannot help it.

Yes it would be great to use modern opengl calls and strategies to speed things up but that would kill a bunch of people from using the app. (my lap top can only use opengl 2.0) Also, libgdx is meant for small devices that only support 2.0.

I tried a VBO version using mapped VBO and it was slower. The bottle neck is either fillrate or cpu filling the arrays/updating particles. This seems to be true for both libgdx and mine.

My computer only supports OpenGL 2.1 but, as with most other computers these days, supports a variety of 3.0+ extensions (GL_EXT_texture_array, GL_EXT_framebuffer_object, GL_EXT_geometry_shader4, GL_ARB_texture_float). In other words, things like theagentd’s tile map shader are definitely possible even when 3.0+ is not present.

That doesn’t mean you need to rely on these extensions; but if they are present, you may as well make use of them since the vast majority of your audience will benefit from them. Those on old/shitty systems can turn down the quality settings (which might equate to capping your total particles or something).

Geometry shaders will reduce CPU load. Texture arrays will reduce texture binds. Passing circles instead of quads/triangles can reduce fill-rate.

[quote]I tried a VBO version using mapped VBO and it was slower.
[/quote]
This could be due to your code (i.e. not re-using buffers) or GPU (as Cero noted). Maybe you should post some of your code and implementation instead of saying “Here is what my sprite renderer looks like next to LibGDX, end of story.”

[quote]Do both libgdx and Stumpy do the clustering FPS thing? If so I really don’t see an issue. If it is your GPU doing that I cannot help it.
[/quote]
It’s not really a benchmark if it can be easily influenced by user input (and the results skewed).

Also, I meant 50,000.

This could be due to your code (i.e. not re-using buffers) or GPU (as Cero noted). Maybe you should post some of your code and implementation instead of saying “Here is what my sprite renderer looks like next to LibGDX, end of story.”
[/quote]
Well this was what my current sprite render looks like next to libgdx’s. Not necessarily end of story but just wanted to see what other peoples results are as I can only test on 2 computers.
I was reusing buffers but the main bottle neck, as I said, was filling the arrays with the data. VBOs I think also have a little more over head then vertex arrays, but I may be wrong. Also, my design of particles is not the most optimal as it is basically just rendering sprites.

[quote]Do both libgdx and Stumpy do the clustering FPS thing? If so I really don’t see an issue. If it is your GPU doing that I cannot help it.
[/quote]
It’s not really a benchmark if it can be easily influenced by user input (and the results skewed).
[/quote]
How is what your GPU does user input? I don’t really know how to make it so your GPU does not do certain things that I do not tell it to do. I am not doing any culling so…you may just have nice drivers. This was more of a, “what do you get with both given your current hardware”

The reason why I am testing mine against libgdx is because libgdx is very fast and optimized for lower hardware specs. (from what I know) I am sorry if some how I am stepping on your toes but I do not claim to be pro in anything.

Cool stuff! Nice to see that we are still doing OK for the most part :slight_smile:

Couple of comments:

OpenGL ES 2.0 sadly has no VBO mapping. I guess that would probably the fastest option. As it stands, VAs are tons faster Android than VBOs. The reason seems to be a bit weird on Android: if you use a single VBO and render multiple batches with it, you’ll stall the GPU hard. I tried to fix this by using a pool of VBOs, but whatever i did, VAs always won by a large margin. For this reason we now use VAs exclusively.

Not on Android. On versions < 3, anything direct Buffer is totally fucked. JNI overhead kills you on Android as well. And Dalvik can optimize tight array access rather well. Long story short: using an array is a factor 10-15 faster than using a direct Buffer on Android.

Jupp, i just couldn’t figure out how to let the user decide which uniforms should be send to the shader. I guess i coud disable all uniform setting via a boolean flag though and have the user set the camera matrices. Hrmm, good input, thanks!

Yes, that is mostly to guarantee that after begin()/end() we leave OpenGL ES in a clean state. There are probably a few places where we could trim that down. I’d love to get a pull request :smiley:

Agreed. However, my time budget is only so big, and i’m not sure there could be such an elaborate 2D game that the current method is insufficient on the desktop. Any desktop machine that supports geometry shaders, multiple render targets, texture arrays is likely so beefy that the additional code paths are not worth the time imo.

That is actually true. Polygon support was added by a third party, i didn’t look to hard into it. I think i remember there being a bit of an issue with the indices array, which SpriteBatch generates once on startup as it is fixed. Not sure though.

I actually had that at one point, turns out to be a lot slower than what we currently have, again, only on Android.

Concious decision not to use the z-buffer with SpriteBatch. There’s DecalBatch for that.

Awesome feedback, thanks a bunch. I guess i’ll try to fix a few of those issues.

Requires OGL 3, spawn with the G key.
http://www.mediafire.com/?bayay2l6snydi7r

Yep, basically it comes down to this: LibGDX’s sprite batch is sufficient for the vast majority of cases, and the performance gain from geometry shaders (or another technique) will likely be negligible and not worth losing portability/flexibility/ease-of-use.

I would like to get some pull requests in, though, eg: mouse cursors, closeRequested, and other desktop features.

How easy is it to render using the geometry shader?

Can you abstrract it to

drawImage/renderImage(blah blah blah…location size…blah blah blah)

???

If so…sweet ;D. Still wont run on my laptop though…actually everything you have posted won’t run on it. Stinkin integrated chip.

I know. :frowning: I’m neglecting libgdx and kryo a tiny bit lately to try and get some projects done that can make some money. Daddy gotta eat! I’ll be back, no worries.

If you switch to LwjglFrame you can use Swing to do a lot of window related that LWJGL’s window doesn’t yet support (getting/setting size/minimized state, cursors, close requested, etc). Likely a small performance hit but I doubt it’d hurt most apps.

Easier in my opinion, but you have to make a shader that supports all the functions you want. This one just supports a position, a size (width and height) and a color. It then generates texture coordinates in the shader from 0 to 1. You could implement rotation and texture arrays (= as many different textures as you can store in VRAM which is a LOT) for example. To “render” something, just throw in the data needed for a sprite into a buffer and render all of them with glDrawArrays(). It’s a lot more efficient since you don’t have to upload 4 vertices per sprite.

Performance is the biggest difference. Your two JARs can handle around 75-80k particles at 60 FPS on my comp, and around 60k when I hold down the mouse. Mine runs with 1100k particles at 60 FPS and is only very slightly affected by holding the mouse, but that’s because I simply use all cores to update and write the sprite data to the VBO. That makes it RAM limited since there’s so much processing power available. That’s cheating you say? Okay, with one thread 600k particles at 60 FPS, 425k when I hold down the mouse.

There’s no magic on the CPU side. These three functions in my Particle class are the only things I run each frame :


     public void pull(int mouseX, int mouseY){
			
			float dx = mouseX - x, dy = mouseY - y;
			float distSqrd = dx*dx + dy*dy;
			
			//float force = 0.01f / (float)Math.sqrt(distSqrd);
			float force = 2f / distSqrd;
			vx += dx * force;
			vy += dy * force;
		}

		public void update() {

			vx *= 0.999;
			vy *= 0.999;
			
			x += vx;
			y += vy;
			
			if((x < 0 && vx < 0) || (x > WIDTH && vx > 0)){
				vx = -vx;
			}
			
			if((y < 0 && vy < 0) || (y > HEIGHT && vy > 0)){
				vy = -vy;
			}
		}

		public void put(ByteBuffer data) {
			data.
			putFloat(x).putFloat(y).
			putFloat(width).putFloat(height).
			putInt(color);
		}

This one is pretty simple, but you can expand it as much as you want. Texture arrays allows you to use as many textures as you want with mipmapping, custom texture coordinates to allow you to pick out parts of the texture, not just a whole layer; rotation done by the GPU per sprite, coloring, multitexturing, whatever you want. The only thing that you can’t do is custom non-rectangular geometry. Rotation is fine, but if you need coordinates per point, there’s pretty much no point in using a geometry shader to expand it. You might save a few bytes = gain a few FPS, but the win would be pretty minimal.

Compared to no shaders/OGL 2 shaders memory usage is reduced a lot since we don’t have to duplicate data between vertices. For the above stuff you’d need 4 2D coordinates per corner, and you’d need to duplicate the color data once per vertex. That’s 4 x 2 x 4 bytes for the positions + 4*4 bytes of color = 32 + 16 = 48 bytes per sprite. I just use 4 floats and 4 bytes = 20 bytes per sprite. Since we’re handling so much data, performance increases a lot by simply reducing it. Add that we don’t need to calculate 4 positions on the CPU (the 4 corners) since we just drop in the center position and a width and a height. That’s 8 saved float additions per sprite. If you want rotation you’d have to do that on the CPU too since you can’t manipulate matrices and stuff between each sprite. Since GPUs are so fast it won’t budge an inch from the extra load, but your CPU will take a huge hit if you do it there.

hmm my comp is about 10 fps slower than yours but on your program it gets less then half of what you get. :clue:

Never said multi-threading was cheating.

When I use the put method in ByteBuffers I lose 50% performance vs put(array)

Hehe I can have particles that are not perfect squares but I don’t think that is really a big deal. The biggest bottle neck I think is again filling arrays for the gpu. By using the geometry shader it seems that you can, as you have said, reduce the needed vertices from one at each corner to just one. I would really love to try and rewrite my sprite batcher using this method but I could not run it on my laptop which is what I code on mostly.

How would you handle things like blinking particles, growing/shrinking, maxSizes/minSizes, maxFade/minFade, animated etc etc etc. Just have a particle with a bunch of variables? It is what I am doing now but I am worried about memory usage.

LibGDX one - 20-30 fps @150000 particles.
Yours - 25-35 fps @150000 particles.

Uh, how fast was it? You’re not very clear… =S Also, what are your specs?

Blinking particles: Update the color on the CPU each frame, or send the time it’s been alive and generate the blinking effect there.
Size changing: Just change the width and height variables? It’s the most flexible way at least. If you have the time it’s been alive (from blinking particles) you could use that too.
Fading: Either update the color on the CPU or use the time it’s been alive.
Animated: Upload a time variable and pick out textures based on the time passed (veeery easy with texture arrays). You can loop animations too.

EDIT: I tried a second technique, instancing. The idea was to draw the particles as instances of a 4 vertex quad. The data can still be uploaded per instance = per particle. Sadly, this proved to be a really bad idea. First of all it’s less flexible since I’m not using a geometry shader (and if I did there’s no point), and secondly performance was horrible. It was obviously not made for instancing stuff with only 4 vertices, more like 500< or so.

I can do all the stuff I said just I don’t know if it is most efficient.

With my batcher I get 13-15fps slower then you (I was off before) but with your app at about 220k I drop below 60 fps.
at 500k I get 30fps or lower.

I have a Q6700 quad core (2.66 hz), 3 gig ram, and BFG Geforce GTS 250. (over clocked and 1 gig of vram) Nothing really nice but I cam play most games on high settings so w/e

Okay, I have a GTX 295 (though using only one GPU), so I guess you’re hitting a GPU bottleneck. That’s not really a bad thing though, since that means you have plenty of CPU time left for other stuff.

I think it is a cpu bottle neck with a fillrate bottle neck as my cpu is 100% during the app…and probably gpu. I need a new computer this one is 5 years old…

Another way to improve fill rate for circle-shaped particles is to not use quads (i.e. draw a circle made up of GL_TRIANGLE_FAN). I haven’t tested this in practice, so the tradeoff may not be worth it, but it would be better suited for a geometry shader implementation.

[quote]Performance is the biggest difference. Your two JARs can handle around 75-80k particles at 60 FPS on my comp, and around 60k when I hold down the mouse. Mine runs with 1100k particles at 60 FPS and is only very slightly affected by holding the mouse, but that’s because I simply use all cores to update and write the sprite data to the VBO. That makes it RAM limited since there’s so much processing power available. That’s cheating you say? Okay, with one thread 600k particles at 60 FPS, 425k when I hold down the mouse.
[/quote]
Impressive. Last time I tested geometry shaders I didn’t notice that much of an increase in sprite count; I’ll have to give it another go.

Can’t run your tests since it uses GL 3.0. Would be interested to see a GL 2.0 version (through extensions).