Sprite Shootout Contest Thread

badlogicgames · March 17, 2011, 11:50pm

System: nvidia gtx 285, latest drivers, intel core 2 duo q9550 @2.85 ghz, java 1.6.0_22, 64-bit, client

slick(kappa): 19116 @ 60fps
slick(apple): 28300 @ 60fps
libgdx: 110000 @ 60fps

I still think this microbenchmark is flawed. We should do what Riven said, and also target a sprite count instead of an fps count. I also believe that a specialized renderer for the task at hand could outperform libgdx. But that wouldn’t be a generic solution useable in a game i’d say (and if it was i’d like to integrate it in libgdx :))

badlogicgames · March 18, 2011, 1:05am

edit: stupid, stupid me turns out i can’t beat SpriteBatch after all…

Nate · March 18, 2011, 4:10am

My GTX275 looks the same as badlogic’s 285:

http://screensnapr.com/e/EybrGg.png

Now we wait for Cas. Bring it!

CommanderKeith · March 18, 2011, 5:31am

Thanks, yep that’s a feather in the cap for lwjgl and java.

One thing about that applet which is a bit weird tho is that the sprites don’t appear to move smoothly. I’m testing on a computer at uni which I don’t have admin access to so I can’t provide the system specs sorry. I’ll test at home and see what it’s like.

zammbi · March 18, 2011, 2:15pm

Intel i7 1.7GHz (Quad), Nvidia GT 435M, 6 gigs ram.

libgdx - 17800 balls at 60fps
Artemis - 16400 balls at 60fps
Slick2D - 12700 balls at 60fps

appel · March 18, 2011, 2:41pm

On work computer:

Artemis: 6900 @ 60fps
Kappa: 6900 @ 60fps
libgdx: 6900 @ 60fps

Same performance in all!
Got NVidia Quadro FX570, Duo Core CPU.

badlogicgames · March 18, 2011, 3:34pm

Everyone wins

Riven · March 18, 2011, 4:49pm

Obviously these demos are fillrate limited: the code of the sprite engine has barely any impact on the framerate.

badlogicgames · March 18, 2011, 5:02pm

It depends on the GPU. The mobile Nvidia crap has some very strange performance characteristics. Apple and Kappa also posted some of the results on IRC. GPUs were 8800 GT and 9600 GT, showing as big a gap as the tests by Nate and myself.

(12:26:42 AM) kappaOne: libgdx hits 45k balls before fps starts to drop
(12:26:45 AM) kappaOne: highly impressive
(12:27:13 AM) ***kappaOne closes all graphics intensive apps to allow gpu to cool down
(12:27:16 AM) badlogic: what’s the number of the other benchmarks?
(12:27:27 AM) badlogic: *for
(12:27:28 AM) badlogic: fme
(12:27:32 AM) kappaOne: 28k on kappa’s, 24k on appels

(12:25:05 AM) badlogic: libgdx: 8700 @60fps
(12:25:05 AM) badlogic: slick (kappa): 3016 @60fps
(12:25:05 AM) badlogic: slick (apple): 3300 @60fps
(12:25:09 AM) badlogic: nvidia ion 2
(12:25:16 AM) badlogic: atom cpu
(12:25:22 AM) badlogic: single core, 32-bit vm, client

As i said earlier, the bench is pretty much useless in any case Fun, none the less.

Alan_W · March 18, 2011, 7:28pm

This laptop’s only got Intel embedded graphics: libdx:4000ish, slick:3500ish
My old Mac Powerbook G4 was even slower, even though it has a proper GPU (Can’t remember what offhand)

Edit: I’ve been thinking about how I’d write a fast sprite library. Initial thoughts centred around vertex arrays, then moved to VBOs. However if the demo moves every sprite a bit every frame, then I’d only use each VBO once, which seems pointless. Maybe I could reduce the amount of data to transfer with a custom shader with a parameter list that takes x,y. I’d still need to supply this per vertex, which would be an overhead if I used textured quads. Maybe an indexed parameter list. Maybe look at glDrawPixels instead, although most cards are optimised for 3D functions, so probably textured quads would be faster.

Maybe a custom shader that takes velocity/time of last position set as a parameter list and gets time from somewhere (another parameter list or write to a sub of this one. The shader than calculates position(clip space) = position(object space) + parameter(velocity) * (parameter(current time) - parameter(vertex position timestamp). Then I’d only need to send time each frame and update sub parts of the VBO and parameter list where a sprite had changed velocity. In an ideal world I would send delta time and write the updates directly into the VBO in the shader, but I don’t think I can do this directly; might be able to cheat round this.

Trouble is I’m not sure whether the bottleneck will be fill-rate, or CPU-GPU vertex transfer (which will be a lot with a silly number of sprites). I’ve only occasionally used opengl and that in immediate mode, so don’t really know what would be best performance wise.

Spasi · March 19, 2011, 7:20pm

My entries:

Plain OpenGL implementation: JNLP - Source
OpenCL + OpenGL implementation: JNLP - Source

You may use the following controls in both demos:

+/-: Increment/decrement ball count by 100. Hold CTRL to multiply by 10, ALT by 100, SHIFT by 1000.
A: Toggles animation.
C: Toggles color mask (removes fill rate limit).
R: Toggles rendering (no render calls, but VBOs are updated if animation is on).
S: Toggles smooth animation. By default animation is fixed at 60Hz, but when smooth animation is on it’s done every frame.
T: Toggles between the default texture (42x42) and a smaller version of the same texture (16x16), useful when fill rate limited.
V: Toggles v-sync.
0-9: Update ball count (1 << key pressed), not very useful.

Results on my machine - Intel Q6600 (2.4GHz) + Radeon 5870:

Plain: 119k @ 60fps - screenshot
CL: 440k @ 60fps - screenshot
CL, 16x16: 1.5m @ 60fps - screenshot

I also tried a pseudo-instancing implementation, hits 60 fps at only ~20k balls (2 OpenGL calls per ball).

Note for those with working CL drivers: You may have to kill the javaw process manually after exiting the CL demo, not sure why, might be an LWJGL bug.

kappa · March 19, 2011, 11:19pm

Wow, nice work Spasi, fastest implementation yet

Kappa - 18000 balls at 60fps
Appel - 22000 balls at 60fps
LibGDX - 48500 balls at 60fps
Spasi’s Plain OpenGL - 50000 balls at 60fps

Sadly don’t have OpenCL drivers installed but looks like its the fastest way to go.

Kinda impressed how well LibGDX holds up.

edit: Spasi your OpenCL version would make a great tutorial/example as there are very few examples on how to use it. Be nice if it was on the LWJGL Wiki, so many ppl can benefit from it.

badlogicgames · March 20, 2011, 2:00am

+1 for Spasi, neat implementation. Interesting to see that our simple VA based approach works nearly as well as the VBO/shader based approach of yours. Could you run the other benchs on your machine for comparison?

On my netbook the plain OpenGL version performs at ~7600 balls @ 60fps, so that’s in the same ballpark as libgdx.

Nate · March 20, 2011, 3:09am

I get ~101,000 sprites with Spasi’s plain OpenGL (was 110,000 with libgdx, rechecked just now):

http://screensnapr.com/e/nkfYsn.png

Strangely, after maybe 20 seconds the keystrokes to change the ball count stop working.

I don’t know anything about OpenCL, but I get this (repeatedly):

[CONTEXT MESSAGE] CL_OUT_OF_RESOURCES error waiting for idle on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_OUT_OF_RESOURCES error waiting for idle on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_OUT_OF_RESOURCES error waiting for idle on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_INVALID_COMMAND_QUEUE error executing clFinish on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_OUT_OF_RESOURCES error waiting for idle on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_OUT_OF_RESOURCES error waiting for idle on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_OUT_OF_RESOURCES error waiting for idle on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_OUT_OF_RESOURCES error waiting for idle on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_INVALID_COMMAND_QUEUE error executing clFinish on GeForce GTX 275 (Device 0).

Do I need to install something? The results sound fantastic, I’d love to see it!

ra4king · March 20, 2011, 4:52am

How do you open a console for a JNLP?

zammbi · March 20, 2011, 6:37am

[quote]How do you open a console for a JNLP?
[/quote]
Java control panel -> advance tab -> Java console.

kappa · March 20, 2011, 1:22pm

Can confirm that I also experienced this.

Spasi · March 20, 2011, 4:39pm

I’ve updated both demos. The CPU implementation should be a bit faster now (getting ~127k @ 60 fps on my machine) and I fixed the problem with the CL implementation hanging on exit.

[quote=“Nate,post:34,topic:36444”]
I’ve been seeing this in every LWJGL demo I’ve written lately (keyboard stops working randomly). I must be doing something stupid with my input handling code, but I just can’t see it, it can happen after 5 seconds or after several minutes. Could someone take a look and let me know if you see something weird?

Nate:

CL_OUT_OF_RESOURCES error waiting for idle on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_OUT_OF_RESOURCES error waiting for idle on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_OUT_OF_RESOURCES error waiting for idle on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_INVALID_COMMAND_QUEUE error executing clFinish on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_OUT_OF_RESOURCES error waiting for idle on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_OUT_OF_RESOURCES error waiting for idle on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_OUT_OF_RESOURCES error waiting for idle on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_OUT_OF_RESOURCES error waiting for idle on GeForce GTX 275 (Device 0).
[CONTEXT MESSAGE] CL_INVALID_COMMAND_QUEUE error executing clFinish on GeForce GTX 275 (Device 0).
Do I need to install something? The results sound fantastic, I'd love to see it!

That’s nice actually, I’ve never seen output on the context callback handler from my AMD implementation. I added some extra error checking now, could you please try again and let me know if you see any new messages/exceptions?

Other results on my machine:

kappa: 2500 @ 60fps
appel: 5400 @ 60fps
Nate: 87000 @ 60 fps

I don’t know why kappa’s and appel’s are so low compared to others. I guess my CPU is too shitty. Also, I’d like to see Nate’s demo without vsync enabled, it should get higher than 87k. edit: vsync might be more costly on AMD compared to NV.

I tried today a multi-threaded implementation (using LWJGL’s SharedDrawable), but it was much slower. I’ll try using geometry shaders next. In theory it will be faster than the CL implementation, because right now I’m using clFinish to synchronize animation and rendering, since there’s no support for ARB_cl_event yet. This might take a while, haven’t used GS before and I’m not sure if it will be easy to implement the animation data update. In the OpenCL kernel you can simply read and write data from/to your buffers as needed, it’s actually quite impressive how simple and natural it is:

...
global float *balls
...
float dx = balls[b + 1];
float x = balls[b + 0] + dx * delta;
if ( x < 0.0f ) {
	x = 0.0f;
	balls[b + 1] = -dx;
} else if ( x > bound ) {
	x = bound;
	balls[b + 1] = -dx;
}
balls[b + 0] = x;

edit: Btw kappa, both demos are implemented in LWJGL’s test package (check the source), I’ll commit them when I’m done.

badlogicgames · March 20, 2011, 6:35pm

Great update spasi, looking forward to the GS implementation!

Spasi · March 22, 2011, 6:32pm

Updated again. The CPU implementation will use one of 3 methods to render now:

a) Simple BufferData update + DrawArrays
b) MapBufferRange update + DrawArrays (requires OpenGL 3.0)
c) MapBufferRange update + DrawArraysInstanced (requires OpenGL 3.3)

You should get a message in the console telling you which one is used. Other changes:

ALPHA_TEST is enabled, it saves quite a bit of framebuffer bandwidth.
QUADS are used instead of TRIANGLES.

MapBufferRanged is only a tiny bit faster than simple buffer updates, but DrawArraysInstance made a big difference. There’s only a tiny VBO holding the geometry for a single quad, then there’s another that holds the sprite locations. The vertex shader is then transforming each instance to the right location and size, like so:

uniform float ballSize; // can have sprites of different size (this could also be an attribute, so size changes per instance)
attribute vec2 position; // the instance data
...
gl_Position = gl_ModelViewProjectionMatrix * vec4(position + (gl_Vertex.xy * ballSize), 0.0, 1.0);

Instancing in OpenGL 3.3 is done via ARB_instanced_arrays. I could also test ARB_draw_instanced (core in GL 3.1) for old GPUs, but didn’t have time today. That one requires instance data to be passed as uniform arrays in the shader, but coupled with ARB_uniform_buffer_object (also in GL 3.1) it should be quite fast as well.

The OpenCL implementation was bugged of course. I was basically rendering 1/4th of the quads. When fixed, it hit the same fill rate bottleneck as the CPU implementation, but it scales much better (especially when coupled with DrawArraysInstanced). New numbers on my machine:

CPU: 194k @ 60fps
GPU: 215k @ 60fps
CPU 16x16: 610k @ 60 fps
GPU 16x16: 1.1m @ 60 fps