Sprites!

princec · January 13, 2010, 1:00pm

Arrgh! Me sprite engine isn’t fast enough!

Revenge of the Titans continues apace, with all sorts of extreme niceness being packed in there. Unfortunately this niceness has come at a bit of a cost; we’re now rendering around 1,000 sprites per frame (in maybe 100 draw calls) and we’re getting performance issues (that is, rendering at less than 60fps). Investigation reveals that the biggest bit of compiled code that takes the most time is the writeSpriteToBuffer call in the DefaultSpriteRenderer.

I think that buffer bounds checks are probably accounting for a not inconsiderable time waste, and probably also some general Java inefficiency at dealing with floating point mults and adds but without looking at the equivalent assembly produced by C++ and the JVM I can’t really speculate. And anyway, there’s pretty much bugger all I can do about the contents of that method: I have to do everything in there.

Also, the number of draw calls cannot really be optimised any more than it already is optimising: the sprites are already sorted into the best possible (and only possible) rendering order by virtue of their Y coordinates, layers, sublayers, texture IDs, and rendering state.

So, the only other thing I can think of that will optimise things is using some modernfangled OpenGL cleverness like VBOs or somesuch. I used to use NVidia’s fence stuff and AGP RAM but no longer as it only worked on half the machines out there.

Will VBOs make things any faster for me? The actual amount of vertex data is pretty piddly - we’re only talking 1,000 sprites a frame here, that’s 4,000 vertices, or 125kb (yes, kilobytes!) of data being sent to OpenGL, and that’s being split up into roughly 100 1kb chunks anyway by virtue of the required rendering order.

I’m otherwise rather dismayed at the atrocious performance I seem to be getting these days

Cas

Spasi · January 13, 2010, 1:42pm

You can try pseudo-instancing or ARB_draw_instanced on modern GPUs. You could also use texture atlases or a unified fragment shader that does multiple kinds of rendering, to reduce the number of draw calls.

It all depends on the specifics of your rendering code of course. There are lots of tricks you can use if you’re willing to exploit shaders, but I’m fairly sure you’re going to want a solution that works on older hardware. Unfortunately, I don’t think simply switching to VBOs would make a significant difference, for sprite rendering that is. It could provide a slight performance boost, but I find that VBOs are too platform/vendor sensitive. I wouldn’t depend on them on pre-shader hardware anyway.

pjt33 · January 13, 2010, 2:16pm

Are you sure that you’re limited by the Java code rather than by the fill rate of the graphics card? (And if so, can you tell me how to test this? I’d love to know).

kappa · January 13, 2010, 2:17pm

From the brief look at your code, I see your still using Java’s default Math library. (cos, sin, tan, atan, atan2 are all very slow on it), you should really be using Riven’s FastMath library, its upto 8x faster then java’s default Math library. From what I’ve seen it really does have a big impact in speeding up code, especially for action packed 2d games. Another nice thing about it is it uses all float Math so no need to be casting from double or even using double. (riven’s site seems to be down atm but might be up again soon).

Also use the following method as an alternative to Math.random() http://www.java-gaming.org/index.php/topic,18426.msg155445.html#msg155445
again massively faster then Math.random()

Also what type of Lists are you using? (if any)

princec · January 13, 2010, 2:58pm

@Spasi - looks like it might be fraught with pitfalls and complexity, I have a feeling also that it won’t help too much.

@pjt33 - profiling shows I’m spending, er what was it, 7.5% of the time writing sprites and 35% of the time calling gl commands (25% glDrawArrays or so). That’s on my uber-rig. I wondered if by using some sort of magic VBO memory the glDrawArrays commands would either be quicker or return asynchronously or something clever like that, letting me get on with doing something else with the CPU other than waiting for the GPU to finish.

@kapta - 99% of the sprites aren’t rotated, so the vast majority of sprites drawn never go near sin or cos, so there’s little point in trying to optimise those calls away. About half the sprites in the frame are scaled, which makes for 4 floating point mults in addition to the floating point adds. I suspect that’s probably far less time than I worry about. I would like to know if there was some way to disable buffer bounds checking and see if that’s slowing things down much but I kinda doubt it - after all the actual drawing is taking 4x as long as the data collection phase.

I barely use random numbers, and I barely use Lists. The GC rarely ever fires off ever.

Cas

elias4444 · January 13, 2010, 4:41pm

I’m gonna say it…

Try implementing VBOs! :

My MvR engine was mostly DisplayLists and the like. With my new engine, I switched over to VBOs (which I admit, took some restructuring). The speed increase was VERY noticeable. I even switched my font rendering class over (which is basically like your sprite rendering) and noticed a big leap there as well. I have to wonder if the drivers these days are just geared that way.

The other big leap I saw was when I moved from individual textures for sprites to a single, larger texture with all the sprite graphics in it. It didn’t help much when I was using DisplayLists, but with VBOs it was a different story.

NOW, my engine isn’t just for sprites, so your mileage may vary.

Here, this will help you on your way (it’s the least I can do for the help you’ve given me):


package com.tommyengine.utils;

import java.nio.FloatBuffer;
import java.nio.IntBuffer;

import org.lwjgl.BufferUtils;
import org.lwjgl.Sys;
import org.lwjgl.opengl.ARBVertexBufferObject;
import org.lwjgl.opengl.GLContext;

public class VBOUtil {

	/*
	 * To create a VBO ID
	 */
	public static int createVBOID() {
		if (GLContext.getCapabilities().GL_ARB_vertex_buffer_object) {
			IntBuffer buffer = BufferUtils.createIntBuffer(1);
			ARBVertexBufferObject.glGenBuffersARB(buffer);
			return buffer.get(0);
		} else {
			Sys.alert("ERROR", "VBOs not supported on this hardware");
			System.exit(0);
			return 0;
		}
	}

	public static void destroyVBOID(int id) {
		ARBVertexBufferObject.glUnmapBufferARB(id);
	}

	public static void bufferDynamicData(int id, FloatBuffer buffer) {
		ARBVertexBufferObject.glBindBufferARB(ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, id);
		ARBVertexBufferObject.glBufferDataARB(ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, buffer, ARBVertexBufferObject.GL_DYNAMIC_DRAW_ARB);
	}

	public static void bufferStaticData(int id, FloatBuffer buffer) {
		ARBVertexBufferObject.glBindBufferARB(ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, id);
		ARBVertexBufferObject.glBufferDataARB(ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, buffer, ARBVertexBufferObject.GL_STATIC_DRAW_ARB);
	}

	public static void updateBufferData(int id, FloatBuffer buffer) {
		ARBVertexBufferObject.glBindBufferARB(ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, id);
		ARBVertexBufferObject.glBufferDataARB(ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, buffer, ARBVertexBufferObject.GL_DYNAMIC_DRAW_ARB);
	}

	public static void bind(int vboID) {
		ARBVertexBufferObject.glBindBufferARB( ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, vboID );
	}

	
	//////////////////////////////////////
	//// For Index Buffers (IDOs) ////////
	//////////////////////////////////////

	public static void bindIndex(int vboID) {
		ARBVertexBufferObject.glBindBufferARB( ARBVertexBufferObject.GL_ELEMENT_ARRAY_BUFFER_ARB, vboID );
	}

	public static void bufferIndexData(int id, IntBuffer buffer) {
		ARBVertexBufferObject.glBindBufferARB(ARBVertexBufferObject.GL_ELEMENT_ARRAY_BUFFER_ARB, id);
		ARBVertexBufferObject.glBufferDataARB(ARBVertexBufferObject.GL_ELEMENT_ARRAY_BUFFER_ARB, buffer, ARBVertexBufferObject.GL_STATIC_DRAW_ARB);
	}


}

A sample VBO creation using that code:


// Geometry = FloatBuffer()  -- but I'm guessing you knew that

// .put() all of your vertice information into geometry here

coordsVBOID = VBOUtil.createVBOID();
geometry.flip();
VBOUtil.bufferDynamicData(coordsVBOID, geometry);

// .put() all of your vertice information into textureCoords here

textureVBOID = VBOUtil.createVBOID();
textureCoords.flip();
VBOUtil.bufferDynamicData(textureVBOID, textureCoords);

A sample draw routine:


GL11.glEnableClientState(GL11.GL_VERTEX_ARRAY);
VBOUtil.bind(coordsVBOID);
GL11.glVertexPointer(3, GL11.GL_FLOAT, 0, 0);

GL11.glEnableClientState(GL11.GL_TEXTURE_COORD_ARRAY);
VBOUtil.bind(textureVBOID);
GL11.glTexCoordPointer(2, GL11.GL_FLOAT, 0, 0);

GL11.glDrawArrays(GL11.GL_TRIANGLES, 0, vertexCount);

GL11.glDisableClientState(GL11.GL_TEXTURE_COORD_ARRAY);
GL11.glDisableClientState(GL11.GL_VERTEX_ARRAY);

Good luck!

princec · January 13, 2010, 4:50pm

Actually that’s rather helpful just having it there to stare at. I’ve been skirting around VBOs for years because they used to be a bit rubbish and I never really needed the performance before. Let’s hope that it gets me that much-needed speed increase! (I’ve managed to get another 10% boost by using the server VM too)

Cas

elias4444 · January 13, 2010, 4:52pm

DOH! And I just remembered… I super-boosted my particle engine performance by moving all of the QUAD sprites for it into a SINGLE VBO!!! Yes! A SINGLE VBO!!! It drastically reduced my number of GL calls.

My ENTIRE list of viewable particles (which are all QUAD sprites) is called by this single function:


	public void draw() {

		int vertexCount = (numVisible)*4;
		if (vertexCount > 0) {

			GL11.glPushMatrix();

			texture.bind();

			GL11.glEnableClientState(GL11.GL_VERTEX_ARRAY);
			GL11.glEnableClientState(GL11.GL_COLOR_ARRAY);
			GL11.glEnableClientState(GL11.GL_TEXTURE_COORD_ARRAY);

			VBOUtil.bind(geomVBOID);
			GL11.glVertexPointer(3, GL11.GL_FLOAT, 0, 0);

			VBOUtil.bind(colorVBOID);
			GL11.glColorPointer(4, GL11.GL_FLOAT, 0, 0);

			VBOUtil.bind(texVBOID);
			GL11.glTexCoordPointer(2, GL11.GL_FLOAT, 0, 0);

			GL11.glDrawArrays(GL11.GL_QUADS, 0, vertexCount);

			GL11.glDisableClientState(GL11.GL_VERTEX_ARRAY);
			GL11.glDisableClientState(GL11.GL_COLOR_ARRAY);
			GL11.glDisableClientState(GL11.GL_TEXTURE_COORD_ARRAY);
			
			Texture.unbind();

			GL11.glPopMatrix();

		}

That’s the power of VBOs. When I update the particles, I literally just clear the FloatBuffers, recalculate each particle (I have a particle array for ongoing data), throw them back into the FloatBuffers via .put(), and then call:


particleGeometry.flip();
VBOUtil.updateBufferData(geomVBOID, particleGeometry);
particleColoring.flip();
VBOUtil.updateBufferData(colorVBOID, particleColoring);

princec · January 13, 2010, 4:59pm

My sprite engines a bit more generic than that, it’s already pretty much doing that wherever it can. One of the real killers is Y-sort. I wonder if there’s any easy way I can optimise that part. I only need to Y-sort if the sprites actually overlap each other. What I need is an algorithm to band them together, even if roughly.

Cas

Riven · January 13, 2010, 5:47pm

Inplace array-sort (better than Arrays.sort() that creates an auxiliary Object[])

princec · January 13, 2010, 6:22pm

Take a peek at the sprite engine sort

Cas

elias4444 · January 13, 2010, 6:23pm

Your problems with Y-sort seem a little hefty considering it’s a sprite-based game. Can you give me an example of the sprites that need to be sorted, and why?

Riven · January 13, 2010, 6:45pm

I;'m pretty sure I found the cause of your dogslow writeSpriteToBuffer() method

You are writing data with FloatBuffer.put(value).

Relative puts are SLOW. it’s not like floatArray[p++] = value

Perform management of the ‘offset’ yourself and see a massive performance increase.

Old:


  481 		i += VERTEX_SIZE;
  482 
  483 		buffer.floats.position(i >> 2);
  484 		buffer.floats.put(x10);
  485 		buffer.floats.put(y10);
  486 		buffer.floats.put(z);
  487 		buffer.floats.put(s.isMirrored() ? tx0 : tx1);
  488 		buffer.floats.put(s.isFlipped() ? ty0 : ty1);
  489 		if (useTexture1) {
  490 			buffer.floats.put(s.getTx10());
  491 			buffer.floats.put(s.getTy10());
  492 		}

New:


  481 		i += VERTEX_SIZE;
  482 
  483 		p = i >> 2;
  484 		buffer.floats.put(p++, x10);
  485 		buffer.floats.put(p++, y10);
  486 		buffer.floats.put(p++, z);
  487 		buffer.floats.put(p++, s.isMirrored() ? tx0 : tx1);
  488 		buffer.floats.put(p++, s.isFlipped() ? ty0 : ty1);
  489 		if (useTexture1) {
  490 			buffer.floats.put(p++, s.getTx10());
  491 			buffer.floats.put(p++, s.getTy10());
  492 		}

Due to the non-deterministic behaviour of HotSpot regarding Buffer performance, it might even be better to dump your data into float[]s and byte[]s and put() them into your FloatBuffer/ByteBuffer

princec · January 13, 2010, 6:49pm

Ooh! That sounds like a useful tip. I will give it a whirl.

You’ll soon see why we need to do Y sorting.

Cas

EgonOlsen · January 13, 2010, 8:32pm

How large are these sprites actually? I did a quick test with a pretty stupid sprite blitter that even uses Collections.sort() each frame to do the y-sort (http://www.jpct.net/download/hacks/SpriteTest.zip - SPACE to add 1000 sprites, s to toggle size) and tried how many alpha blended and scaled sprites i could blit onto a 1024*768 screen before falling below 60 fps. The results (large sprites/small sprites) ranged from ~12000/~20000 on a Core2 Quad @ 3.2Ghz/Radeon 5870 down to ~100/~2100 on a P4 @ 2.2Ghz/Geforce 2 Go with an older midrange system AthlonX2@2200Mhz/ATI 3650 AGP being somewhere in the middle with ~1000/~12000. On what kind of system are you actually running your tests on?

princec · January 13, 2010, 8:41pm

5000 small sprites runs at ~60fps using the server VM. Are those sprites alpha blended?

It’s an AMD Turion 64, dual core 1.6GHz. The chipset is a 6150 Go.

Cas

EgonOlsen · January 13, 2010, 8:46pm

Yes, all of them. But the alpha channel isn’t pretty, because it’s calculated at load time based on the black parts of the texture.

elias4444 · January 13, 2010, 8:56pm

Riven: I just tried your method of using an index for the buffer puts… it cut my performance more than in half. I get much better performance clearing the buffer and then refilling it with the updated information. I’m therefore guessing you were specifically talking about instances where you’re only updating a small portion of the data in the buffer?

thalador · January 13, 2010, 9:04pm

I also got a huuuge performance boost when switching to VBOs.

Riven mentioned the Buffers put() operation: For me single put()s are pretty slow, I try to put() everything in the Buffer with a single put(). To do that I keep arrays for all the data (vertex, color, etc. ) and collect the game’s data in there every frame and then put it in the buffer with one call. I even don’t create the arrays every render call to keep the garbage low. Instead a simple int shows how big the array is. Works like a charm for me.

[EDIT] Ahh, that’s exactly what Riven suggests in his latest post. Didn’t see that.

Riven · January 13, 2010, 10:35pm

You saw:

?

FloatBuffer performance is so retarded that it is simply best avoided. Often absolute puts are about 2-4 times as fast as relative puts. Sometimes they are even slower, especially if your have created a heap-FloatBuffer somewhere else.
It sometimes even has completely different performance characteristics in two identical VM launches. float[] is fast, always, independent of the day of the week.