Crabs!

It’s the same speed as before, just with the correct fps counter.

Cas :slight_smile:

HTC Sensation goes down to 30fps with 600 crabs. At 300 it’s down to about 50 fps, 400 it hovers around 40fps.

Endolf

Hm, so some more extensive testing:

With 1 pixel opaque sprites - the bare minimum you’ll agree, and not enough to be troubled by fillrate, I eventually drop down to 15fps with 2300 or so sprites. That’s about 35 sprites per millisecond, which just strikes me as really poor :frowning: And this is purely the “rendering” part as well - this is just GL commands being issued, no sorting etc. going on here, apart from the tiny overhead to draw the FPS counter stuff. Targeting a phone that’s not quite as powerful as the Galaxy S II - say, something around 800MHz versus its 1.2GHz, perhaps - we’d probably be looking at 2/3rds of that - maybe 22 sprites per millisecond.

And that’s not even thinking about the fact that we’re not actually counting the game logic or sprite engine overhead here :frowning: On a single-core phone maybe we’d only get half the performance, like 11 sprites per millisecond! To get a half decent framerate (30fps being precisely half-decent) on a single core 800MHz phone means we’d be looking at using a paltry 350-odd sprites per frame! You might not think that’s such a big deal but again, that’s with no game logic at all to speak of. Even our simple little space invaders game Titan Attacks uses between 500-1000 sprites per frame and that doesn’t include non-sprite rendering like special effects and score displays. Droid Assault uses between 1500-5000 sprites per frame!

I tried upping the size of the sprites to see what difference fill rate made to the Galaxy and remarkably even increasing the sprites to 32x32 pixels had absolutely no effect on framerate, leading me to think that the hardware rasteriser is proper fast.

So: I’ve uploaded a new .apk* with no crabs, just tiny pixels. At what point do you reach a pretty consistent 15fps?

Cas :slight_smile:

  • sorry about the size - in the middle of working around an Android asset problem

[quote]At what point do you reach a pretty consistent 15fps
[/quote]
370

phone has a 600 MHz single core cpu

370

phone has a 600 MHz single core cpu
[/quote]
Oh dear, that’s lame :frowning:

Cas :slight_smile:

I did buy this phone new together with a plan, this year.
So my guess is, it’s safe to assume that many people have phones this slow

you don’t even notice it normally - everything is fast, except for a few games from the market

try to make the best out of it =D

Well, I am currently very suspicious anyway. According to the ARM website, the GPU inside the Galaxy should manage 30m triangles / second (in admittedly what is probably the most contrived test case possible of a single triangle fan with 30m triangles in it). If every single triangle was discrete we might reasonably assume that to end up being 10m triangles/second, or 30m vertices/second. The sprite engine draws each sprite as a pair of triangles with two shared vertices so we should be using up approximately 4 vertices per sprite, and so the theoretical simple throughput should be about 7.5m sprites per second, or 7500 sprites per millisecond. And yet I calculate I’m getting just 35 sprites per millisecond.

Clearly something is completely amiss, for the specs to tell me I should be managing 200x more sprites.

To clarify further: this is literally a single call to glDrawElements with one single batch of GL_TRIANGLES, where each triangle is a single pixel, drawn opaquely. Nothing else. And yet I’m 200x slower than I was expecting. Or have I got something seriously wrong with my maths?

Cas :slight_smile:

Let me start by stating that I haven’t run this benchmark, so I have no idea what I’m talking about.

Are you sure fillrate isn’t your bottleneck?

Galaxy S II

30 FPS until 1500 crabs.

Definitely. These are 1 pixel sprites. When I increase each sprite to 32x32 the FPS is unchanged.

Cas :slight_smile:

Samsung Galaxy S
15 fps @ ~1200

Aha! Bad news! For me, anyway. Just after I went to bed last night (bah, typical) I realised of course that the answer was staring me in the face. It couldn’t possibly be the single call to glDrawElements that was slow. And indeed it wasn’t, when I commented it out. In fact, the call to glDrawElements is so fast it doesn’t even register. Unfortunately it’s my sprite engine that’s slowing things down. Now I’m going to get to grips with Android profiling and find out what bit’s slowest. This is going to be no mean feat - I think I need a factor of 10 speedup :S

Cas :slight_smile:

My suspicions are confirmed. Turns out sorting takes almost no time at all. Remove the buffer write though, and I get 11,000 sprites before it drops to 15fps. If I remove the sprite transform/scale/rotate part I get 12,500 sprites, but there’s not a lot I can really do about that code so it’s not going to be optimisable any further. So: mere writing the vertex data is giving me a 5x slowdown, which is very troublesome and suspicious. That is, after all, only 300kb of data in a frame for 2,300 sprites.

Cas :slight_smile:

Why not share it, so we can poke at it. Maybe turn it into a competition, for a small prize :slight_smile:

That way you can work on your next game, instead of on silly performance problems.

My MappedObject library will probably help you out, as you can completely skip that step.

I fear that I don’t have access to Unsafe in Android. And I am resolute that I will not use native code either :slight_smile:

Here’s the offending method:


		void add(Sprite s, Style newStyle) {
			SpriteImage image = s.image;
			GLBaseTexture newTexture0 = image.texture;
			if (currentRun == null || newStyle != currentStyle || newTexture0 != currentTexture) {
				// Changed state. Start new state.
				currentRun = stateRun[numRuns ++];
				currentRun.style = newStyle;
				currentRun.texture = newTexture0;
				currentRun.startIndex = indexCursor;
				currentRun.endIndex = indexCursor;
				currentStyle = newStyle;
				currentTexture = newTexture0;
			}

			final float tx0 = image.tx0;
			final float tx1 = image.tx1;
			final float ty0 = image.ty0;
			final float ty1 = image.ty1;
			final float xscale = s.xscale;
			final float yscale = s.yscale;
			final float x = s.x + s.ox;
			final float y = s.y + s.oy;
			final float alpha = s.alpha * ALPHA_DIV;

			// First scale then rotate coordinates
			float scaledx0 = -image.hotspotx * xscale;
			float scaledy0 = -image.hotspoty * yscale;
			float scaledx1 = (image.w - image.hotspotx) * xscale;
			float scaledy1 = (image.h - image.hotspoty) * yscale;

			float scaledx00, scaledx10, scaledx11, scaledx01, scaledy00, scaledy10, scaledy11, scaledy01;

			// Then rotate
			final double angle = s.angle;
			if (angle != 0) {
				double angle2 = toRadians(angle);
				double cos = cos(angle2);
				double sin = sin(angle2);

				scaledx00 = (float) (cos * scaledx0 - sin * scaledy0);
				scaledx10 = (float) (cos * scaledx1 - sin * scaledy0);
				scaledx11 = (float) (cos * scaledx1 - sin * scaledy1);
				scaledx01 = (float) (cos * scaledx0 - sin * scaledy1);
				scaledy00 = (float) (sin * scaledx0 + cos * scaledy0);
				scaledy10 = (float) (sin * scaledx1 + cos * scaledy0);
				scaledy11 = (float) (sin * scaledx1 + cos * scaledy1);
				scaledy01 = (float) (sin * scaledx0 + cos * scaledy1);
			} else {
				scaledx00 = scaledx0;
				scaledx10 = scaledx1;
				scaledx11 = scaledx1;
				scaledx01 = scaledx0;
				scaledy00 = scaledy0;
				scaledy10 = scaledy0;
				scaledy11 = scaledy1;
				scaledy01 = scaledy1;
			}

			// Then translate them
			final float x00 = scaledx00 + x;
			final float x01 = scaledx01 + x;
			final float x11 = scaledx11 + x;
			final float x10 = scaledx10 + x;
			final float y00 = scaledy00 + y;
			final float y01 = scaledy01 + y;
			final float y11 = scaledy11 + y;
			final float y10 = scaledy10 + y;
			final ReadableColor[] colors = s.color;

			FloatBuffer floats = vertices;
			IntBuffer ints = intColors;
			int vertex = vertexCursor * VERTEX_SIZE_IN_FLOATS;
			int ivertex = vertex;
			floats.position(vertex);
			ints.position(ivertex);
			floats.put(x00);
			floats.put(y00);
			floats.put(0.0f);
			floats.put(s.mirrored ? tx1 : tx0);
			floats.put(s.flipped ? ty0 : ty1);
			ReadableColor color = colors[0];
			ints.put((color.getRed() << 0) | (color.getGreen() << 8) | (color.getBlue() << 16)
					| (int) (color.getAlpha() * alpha) << 24);

			vertex += VERTEX_SIZE_IN_FLOATS;
			ivertex += VERTEX_SIZE_IN_FLOATS;
			floats.position(vertex);
			ints.position(ivertex);
			floats.put(x10);
			floats.put(y10);
			floats.put(0.0f);
			floats.put(s.mirrored ? tx0 : tx1);
			floats.put(s.flipped ? ty0 : ty1);
			color = colors[1];
			ints.put((color.getRed() << 0) | (color.getGreen() << 8) | (color.getBlue() << 16)
					| (int) (color.getAlpha() * alpha) << 24);

			vertex += VERTEX_SIZE_IN_FLOATS;
			ivertex += VERTEX_SIZE_IN_FLOATS;
			floats.position(vertex);
			ints.position(ivertex);
			floats.put(x11);
			floats.put(y11);
			floats.put(0.0f);
			floats.put(s.mirrored ? tx0 : tx1);
			floats.put(s.flipped ? ty1 : ty0);
			color = colors[2];
			ints.put((color.getRed() << 0) | (color.getGreen() << 8) | (color.getBlue() << 16)
					| (int) (color.getAlpha() * alpha) << 24);

			vertex += VERTEX_SIZE_IN_FLOATS;
			ivertex += VERTEX_SIZE_IN_FLOATS;
			floats.position(vertex);
			ints.position(ivertex);
			floats.put(x01);
			floats.put(y01);
			floats.put(0.0f);
			floats.put(s.mirrored ? tx1 : tx0);
			floats.put(s.flipped ? ty1 : ty0);
			color = colors[3];
			ints.put((color.getRed() << 0) | (color.getGreen() << 8) | (color.getBlue() << 16)
					| (int) (color.getAlpha() * alpha) << 24);


			// Write indices: need 6, for two triangles
			indices.position(indexCursor);
			indices.put((short) (vertexCursor + 0));
			indices.put((short) (vertexCursor + 1));
			indices.put((short) (vertexCursor + 2));
			indices.put((short) (vertexCursor + 0));
			indices.put((short) (vertexCursor + 2));
			indices.put((short) (vertexCursor + 3));


			indexCursor += 6;
			vertexCursor += 4;
			currentRun.endIndex += 6;
		}

(You’ve seen it before, and I know it’s maybe not as efficient as it could be - just surprised at how inefficient it is. First plan: write everything to an int[] using FloatToIntBits then blat the entire int[] to the bytebuffer. Suspect that might be the best way for Android. Have to do int[] because of a colossal Android performance snafu when using FloatBuffer bulk puts.

Cas :slight_smile:

  1. See how my FastMath.sin/cos speeds up that rotation part, although you’re probably not rotating right now. Remove the double. You can reduce the memory footprint of the lookuptable with the static SIN_BITS variable.

  2. Only use indexed put/get on buffers. Threat a buffer like an array: do your own managing of indices.


    // Write indices: need 6, for two triangles
    indices.position(indexCursor);
    indices.put(...);
    indices.put(...);


    // Write indices: need 6, for two triangles
    indices.put(indexCursor+0, ...);
    indices.put(indexCursor+1, ...);

Ok, by using an int[] array to build vertex data, and putting all the floats into it using Float.floatToRawIntBits(), then copying all the vertex data in one go to the direct buffer. I managed to double performance: got it up to 4600 sprites @ 15fps. With your FastMath.sinDeg/cosDeg methods, I managed to eke a further 10% out and got it up to 5200 sprites @ 15fps.

So that’s about 2x faster basically. It would help now to get it about 5x faster. I’m just going to try absolute puts into a direct IntBuffer and avoid the int[] array copy and see how that improves things…

Cas :slight_smile:

I think that’s too much bytecode for a single method. Your implementation must look much different now, but here’s how I’d start optimizing:

void add(Sprite s, Style newStyle) {
	SpriteImage image = s.image;
	GLBaseTexture newTexture0 = image.texture;
	checkState(newStyle, newTexture0);

	// First scale then rotate coordinates
	float scaledx0 = -image.hotspotx * s.xscale;
	float scaledx1 = (image.w - image.hotspotx) * s.xscale;
	float scaledy0 = -image.hotspoty * s.yscale;
	float scaledy1 = (image.h - image.hotspoty) * s.yscale;

	final float x = s.x + s.ox;
	final float y = s.y + s.oy;

	final float x00, x01, x11, x10;
	final float y00, y01, y11, y10;
	
	// Then rotate & translate
	final double angle = s.angle;
	if ( angle != 0 ) {
		double angle2 = toRadians(angle);
		float cos = (float)cos(angle2);
		float sin = (float)sin(angle2);

		x00 = cos * scaledx0 - sin * scaledy0 + x;
		x01 = cos * scaledx0 - sin * scaledy1 + x;
		x11 = cos * scaledx1 - sin * scaledy1 + x;
		x10 = cos * scaledx1 - sin * scaledy0 + x;

		y00 = sin * scaledx0 + cos * scaledy0 + y;
		y01 = sin * scaledx0 + cos * scaledy1 + y;
		y11 = sin * scaledx1 + cos * scaledy1 + y;
		y10 = sin * scaledx1 + cos * scaledy0 + y;
	} else {
		x00 = scaledx0 + x;
		x01 = scaledx0 + x;
		x11 = scaledx1 + x;
		x10 = scaledx1 + x;

		y00 = scaledy0 + y;
		y01 = scaledy1 + y;
		y11 = scaledy1 + y;
		y10 = scaledy0 + y;
	}

	FloatBuffer floats = vertices;
	IntBuffer ints = intColors;

	final float tx0 = s.mirrored ? image.tx1 : image.tx0;
	final float tx1 = s.mirrored ? image.tx0 : image.tx1;
	final float ty0 = s.flipped ? image.ty1 : image.ty0;
	final float ty1 = s.flipped ? image.ty0 : image.ty1;
	final float alpha = s.alpha * ALPHA_DIV;
	final ReadableColor[] colors = s.color;

	int vertex = vertexCursor * VERTEX_SIZE_IN_FLOATS;
	int ivertex = vertex;
	floats.position(vertex);
	ints.position(ivertex);
	putVertex(floats, ints, x00, y00, tx0, ty1, colors[0], alpha);

	vertex += VERTEX_SIZE_IN_FLOATS;
	ivertex += VERTEX_SIZE_IN_FLOATS;
	floats.position(vertex);
	ints.position(ivertex);
	putVertex(floats, ints, x10, y10, tx1, ty1, colors[1], alpha);

	vertex += VERTEX_SIZE_IN_FLOATS;
	ivertex += VERTEX_SIZE_IN_FLOATS;
	floats.position(vertex);
	ints.position(ivertex);
	putVertex(floats, ints, x11, y11, tx1, ty0, colors[2], alpha);

	vertex += VERTEX_SIZE_IN_FLOATS;
	ivertex += VERTEX_SIZE_IN_FLOATS;
	floats.position(vertex);
	ints.position(ivertex);
	putVertex(floats, ints, x01, y01, tx0, ty0, colors[3], alpha);

	putIndices(indices, indexCursor, vertexCursor);

	indexCursor += 6;
	vertexCursor += 4;
	currentRun.endIndex += 6;
}

private void checkState(Style newStyle, GLBaseTexture newTexture0) {
	if ( currentRun == null || newStyle != currentStyle || newTexture0 != currentTexture ) {
		// Changed state. Start new state.
		currentRun = stateRun[numRuns++];
		currentRun.style = newStyle;
		currentRun.texture = newTexture0;
		currentRun.startIndex = indexCursor;
		currentRun.endIndex = indexCursor;
		currentStyle = newStyle;
		currentTexture = newTexture0;
	}
}

private static void putVertex(FloatBuffer floats, IntBuffer ints, float x, float y, float tx, float ty, ReadableColor color, float alpha) {
	floats.put(x);
	floats.put(y);
	floats.put(0.0f);
	floats.put(tx);
	floats.put(ty);
	
	ints.put((color.getRed() << 0) | (color.getGreen() << 8) | (color.getBlue() << 16) | (int)(color.getAlpha() * alpha) << 24);
}

private static void putIndices(ShortBuffer indices, int indexCursor, int vertexCursor) {
	// Write indices: need 6, for two triangles

	indices.put(indexCursor + 0, (short)(vertexCursor + 0));
	indices.put(indexCursor + 1, (short)(vertexCursor + 1));
	indices.put(indexCursor + 2, (short)(vertexCursor + 2));
	indices.put(indexCursor + 3, (short)(vertexCursor + 0));
	indices.put(indexCursor + 4, (short)(vertexCursor + 2));
	indices.put(indexCursor + 5, (short)(vertexCursor + 3));
}

I’ve extracted a few methods and reorganized the code for better stack locality. Looks like vertex == ivertex always, so that could be cleaned-up further, but I wasn’t sure.

It is indeed a big method but unfortunately Dalvik doesn’t do any useful inlining, so method calls are to be avoided.

Just tried directly writing to the IntBuffer using absolute put - much slower than writing to int[] array and copying it all at the end. So there we have it: got about 5000 sprites @ 15fps, or, 1250 or so at a glassy smooth 60fps on the Galaxy 2. I suppose that’s livable with if I curb my expectations a little and make sure Chaz doesn’t go overboard with the particle effects. I think in reality I really need another 2x speedup and as I say, probably not much chance of that happening without proper inlining, more buffer access “intrinsification”, bounds check hoisting, peephole optimisation, etc. in Dalvik, for which I won’t be holding my breath.

Anyway: latest version, once again with actual crabs, is here. One tap makes 100 crabs. Lifecycle still buggered, turned the music off though.

Cas :slight_smile: