Sprite Shootout Contest Thread

ra4king · March 22, 2011, 8:59pm

Java control panel -> advance tab -> Java console.
[/quote]
I have had already set it to “hide” since I don’t want it to pop up every time I run an applet. Where is it hidden when I run a JNLP?

Jono · March 22, 2011, 9:50pm

I get strangely low results from libgdx. Radeon 4350; Phenom II X2 550; Ubuntu 10.10; proprietary Catalyst drivers.

Slick2D: 4000
Artemis: 5000
libgdx: 2800
Spasi’s OpenGL: 6100
Spasi’s OpenGL 16x16: 65000

@60 fps

Nate · March 23, 2011, 3:29am

GTX275:
Spasi CPU: 127k
Spasi GPU: 124k

Impressive numbers! Cool, OpenCL worked this time. The GPU version doesn’t seem smooth though, the balls movement is a little choppy.

I updated the libgdx entry:
http://esotericsoftware.com/spriteshootout/run.jnlp
No more vsync, non-POT texture, and you can press Z to jump by 10k (space for 1k, up/down for 100). With this update I get 114k.

Can you try again now that vsync is off?

Spasi · March 23, 2011, 10:09am

I’m getting 90k @ 60 fps with the updated libgdx demo. Looks like libgdx’s implementation isn’t AMD friendly and maybe mine isn’t NV friendly either.

GTX275 - Spasi 127k - Nate 114k
Rad 5870 - Spasi 194k - Nate 90k

The difference is too big in the Radeon case. Afaict from libgdx code, it uses glBufferData to update all the sprites in one go and it also uses vertex colors (my demo doesn’t), is that correct? Or maybe it’s a CPU issue? What CPU does your machine have Nate? Also, could you check if my demo uses DrawArraysInstanced on your GTX275? (it should)

[quote=“Nate,post:43,topic:36444”]
Hmm, could you try with smooth animation (press ‘S’)? Does that fix it or is it still choppy? It may need a glFinish() before running the CL kernel on the NV implementation.

Roquen · March 23, 2011, 10:49am

This is fall out to the “fastest” using ~0% CPU.

zammbi · March 23, 2011, 12:43pm

[quote]I have had already set it to “hide” since I don’t want it to pop up every time I run an applet. Where is it hidden when I run a JNLP?
[/quote]
Not sure, the only way I know to show it on webstart is having it popup on each one.

Jono · March 23, 2011, 5:57pm

I get the same results: 2800 to get 60 fps. This should be the new version (not cached) as it accepted ‘z’ as keyboard input.

Jono · March 24, 2011, 7:26pm

I tweaked Spasi’s SpriteRendererPlain to use points instead of quads. It doesn’t make any difference on my low end machine, but it does reduce the size of the buffers by a factor of 4. Perhaps on other machines this would make a difference.

	   
private class SpriteRendererPoint extends SpriteRenderer {

	        private final FloatBuffer geom;

	        protected int[] animVBO;

	        protected static final int BALLS_PER_BATCH = 10 * 1000;

	        SpriteRendererPoint() {
	            vshID = glCreateShader(GL_VERTEX_SHADER);
	            glShaderSource(vshID, "#version 110\n" +
	                                  "void main(void) {\n" +
	                                  "     gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;\n" +
	                                  "     gl_TexCoord[0] = gl_MultiTexCoord0;\n" +
	                                  "}");
	            glCompileShader(vshID);
	            if ( glGetShader(vshID, GL_COMPILE_STATUS) == GL_FALSE ) {
	                System.out.println(glGetShaderInfoLog(vshID, glGetShader(vshID, GL_INFO_LOG_LENGTH)));
	                throw new RuntimeException("Failed to compile vertex shader.");
	            }

	            createProgram();

	            Util.checkGLError();

	            final FloatBuffer staticData = BufferUtils.createFloatBuffer(BALLS_PER_BATCH * 2);
	            for ( int i = 0; i < BALLS_PER_BATCH; i++ ) {
	                staticData.put(0.0f).put(0.0f);
	            }
	            staticData.flip();

	            staticVBO = glGenBuffers();
	            glBindBuffer(GL_ARRAY_BUFFER, staticVBO);
	            glBufferData(GL_ARRAY_BUFFER, staticData, GL_STATIC_DRAW);

	            glEnableClientState(GL_TEXTURE_COORD_ARRAY);
	            glTexCoordPointer(2, GL_FLOAT, 0, 0);
	            
	            glEnable(GL_POINT_SPRITE);
	            glPointSize(42.0f);
	            glTexEnvf(GL_POINT_SPRITE, GL_COORD_REPLACE, GL_TRUE);

	            glEnableClientState(GL_VERTEX_ARRAY);
	            
	            System.out.println("Shootout Implementation: CPU animation + BufferData (Points)");
	            geom = BufferUtils.createFloatBuffer(BALLS_PER_BATCH * 2);
	        }

	        @Override
	        public void updateBallSize(){
	            glPointSize(ballSize);
	        }

	        protected void putBall(final FloatBuffer geom, final float x, final float y) {
	            float half = ballSize/2;
	            geom.put(x+half).put(y+half);
	        }

	        public void updateBalls(final int count) {
	            super.updateBalls(count);

	            final int batchCount = count / BALLS_PER_BATCH + (count % BALLS_PER_BATCH == 0 ? 0 : 1);
	            if ( animVBO != null && batchCount == animVBO.length )
	                return;

	            final int[] newAnimVBO = new int[batchCount];
	            if ( animVBO != null ) {
	                System.arraycopy(animVBO, 0, newAnimVBO, 0, Math.min(animVBO.length, newAnimVBO.length));
	                for ( int i = newAnimVBO.length; i < animVBO.length; i++ )
	                    glDeleteBuffers(animVBO[i]);
	            }
	            for ( int i = animVBO == null ? 0 : animVBO.length; i < newAnimVBO.length; i++ ) {
	                newAnimVBO[i] = glGenBuffers();
	                glBindBuffer(GL_ARRAY_BUFFER, newAnimVBO[i]);
	            }

	            animVBO = newAnimVBO;
	        }

	        public void render(final boolean render, final boolean animate, final int delta) {
	            int batchSize = Math.min(ballCount, BALLS_PER_BATCH);
	            int ballIndex = 0;
	            int vboIndex = 0;
	            while ( ballIndex < ballCount ) {
	                glBindBuffer(GL_ARRAY_BUFFER, animVBO[vboIndex++]);

	                if ( animate )
	                    animate(ballIndex, batchSize, delta);

	                if ( render ) {
	                    glVertexPointer(2, GL_FLOAT, 0, 0);
	                    glDrawArrays(GL_POINTS, 0, batchSize);
	                }

	                ballIndex += batchSize;
	                batchSize = Math.min(ballCount - ballIndex, BALLS_PER_BATCH);
	            }
	        }

	        private void animate(final int ballIndex, final int batchSize, final int delta) {
	            animate(geom, ballIndex, batchSize, delta);

	            // This throws OUT_OF_MEMORY error on AMD.
	            //glBufferData(GL_ARRAY_BUFFER, geom.capacity() * 4, GL_STREAM_DRAW);
	            //glBufferSubData(GL_ARRAY_BUFFER, 0, geom);

	            final int i = ballIndex / BALLS_PER_BATCH;
	            glDeleteBuffers(animVBO[i]);
	            animVBO[i] = glGenBuffers();
	            glBindBuffer(GL_ARRAY_BUFFER, animVBO[i]);
	            glBufferData(GL_ARRAY_BUFFER, geom, GL_STREAM_DRAW);
	        }
	    }

Spasi · March 28, 2011, 11:31am

GL implementation: JNLP - Source (GL 2.0+ required)
CL + GL implementation: JNLP - Source (GL 2.0+, CL 1.0, KHR_gl_sharing required)
GL 2-pass implementation: JNLP - Source (GL 3.0+ or EXT_transform_feedback required)

Changes:

Using GL_POINTS now. That was a good idea Jono, it simplified a lot of things.
Fixed a bug that affected performance, the previous update shouldn’t have been so fast. I’m now hitting the exact same fill-rate limit on all implementations.
“Vectorized” the OpenCL code a bit.
Removed the instanced renderer, it’s pointless with GL_POINTS.

New stuff:

I tried using a geometry shader before switching to GL_POINTS, it was quite slow (when not fillrate-limited). That was the first time I used GSH so I’m not sure if it was supposed to be that slow, it was a really simple shader, GL_POINT in, GL_TRIANGLE_STRIP out (with 4 vertices).
For the GL implementations there’s now a new renderer that uses transform feedback to do the animation on the vertex shader. It’s fast like the CL implementation, but without the clFinish().
I added a new demo that renders the sprites in 2 passes, with depth testing enabled. In the first pass sprites are rendered front-to-back, opaque fragments only (alpha == 1.0). In the second pass sprites are rendered back-to-front, transparent fragments only (alpha > 0.0, fragments with alpha == 1.0 are early-depth-rejected). To avoid doing any sorting work and do this fast, I’m basically animating double the number of sprites that are rendered. Because transform feedback animation is so ridiculously fast, it’s not a problem.

Numbers:

GL: 149k @ 60 fps
CL + GL: 146k @ 60 fps
GL 2P: 596k @ 60 fps

GL 16x16: 720k @ 60 fps
CL + GL 16x16: 720k @ 60 fps
GL 2P: 1.65m @ 60 fps

I implemented 2-pass rendering with CL as well, it was equally fast. With CPU animation though, the CPU becomes a bottleneck much sooner than the fill-rate (double the sprites need to be animated). Obviously this technique works so well here because the texture used has so many opaque pixels, so it’s not a general solution, but interesting anyway.

appel · June 10, 2011, 10:08am

So, what are the lessons from this little “experiment”?

I’m using Slick, and it gets bogged down at a few hundred rendered entities. This makes me think if I should stop using Slick and do my own renderer.

Spasi · June 10, 2011, 10:52am

Hmm, this would be the summary (in most important DESC order):

No matter the rendering method, fill-rate is the most limiting factor. Do whatever you can to reduce overdraw and framebuffer bandwidth requirements. Alpha-test and the 2-pass rendering method for sprites with lots of opaque texels helps a lot. If you have lots of translucent sprites (e.g. smoke, explosions, etc) you could render the sprites on an FBO at half (or lower) the display resolution, then composite the result on top of your main rendering pass. See this for an introduction to the technique.
Off-load anything you can to the GPU. If you can do the animation on the shader/CL it’s a big big win, drops your CPU requirements to nothing. Edit: Actually, running any kind of animation code on the GPU should be trivial. The much tougher problem (in a 3D rendering scenario) is to depth sort the particles after animating them. You can either use some Order Independent Transparency technique or do the actual sorting on the GPU, but then we’re talking DX10+ hardware. For a 2D game it’s simpler and depending on the game you may be able to get away with not sorting.
Use the smallest data structure possible. GL_POINTS if you can, instanced QUADs otherwise.

pitbuller · June 10, 2011, 5:49pm

Have you ever noticed any performance problems with slick2d? I read full thread again and it’s look everything is par except pure gl/cl.
I just profiled my last slick game and cpu or gpu usage wasn’t even notable. It’s run with full 60fps even 5years old crappy laptops. About anything that have somekind of graphic card can run it. Pure openGl is allways faster but there is reason why low level stuff should be avoided if there is no need for best performance.