It’s the same speed as before, just with the correct fps counter.
Cas
It’s the same speed as before, just with the correct fps counter.
Cas
HTC Sensation goes down to 30fps with 600 crabs. At 300 it’s down to about 50 fps, 400 it hovers around 40fps.
Endolf
Hm, so some more extensive testing:
With 1 pixel opaque sprites - the bare minimum you’ll agree, and not enough to be troubled by fillrate, I eventually drop down to 15fps with 2300 or so sprites. That’s about 35 sprites per millisecond, which just strikes me as really poor And this is purely the “rendering” part as well - this is just GL commands being issued, no sorting etc. going on here, apart from the tiny overhead to draw the FPS counter stuff. Targeting a phone that’s not quite as powerful as the Galaxy S II - say, something around 800MHz versus its 1.2GHz, perhaps - we’d probably be looking at 2/3rds of that - maybe 22 sprites per millisecond.
And that’s not even thinking about the fact that we’re not actually counting the game logic or sprite engine overhead here On a single-core phone maybe we’d only get half the performance, like 11 sprites per millisecond! To get a half decent framerate (30fps being precisely half-decent) on a single core 800MHz phone means we’d be looking at using a paltry 350-odd sprites per frame! You might not think that’s such a big deal but again, that’s with no game logic at all to speak of. Even our simple little space invaders game Titan Attacks uses between 500-1000 sprites per frame and that doesn’t include non-sprite rendering like special effects and score displays. Droid Assault uses between 1500-5000 sprites per frame!
I tried upping the size of the sprites to see what difference fill rate made to the Galaxy and remarkably even increasing the sprites to 32x32 pixels had absolutely no effect on framerate, leading me to think that the hardware rasteriser is proper fast.
So: I’ve uploaded a new .apk* with no crabs, just tiny pixels. At what point do you reach a pretty consistent 15fps?
Cas
[quote]At what point do you reach a pretty consistent 15fps
[/quote]
370
phone has a 600 MHz single core cpu
370
phone has a 600 MHz single core cpu
[/quote]
Oh dear, that’s lame
Cas
I did buy this phone new together with a plan, this year.
So my guess is, it’s safe to assume that many people have phones this slow
you don’t even notice it normally - everything is fast, except for a few games from the market
try to make the best out of it =D
Well, I am currently very suspicious anyway. According to the ARM website, the GPU inside the Galaxy should manage 30m triangles / second (in admittedly what is probably the most contrived test case possible of a single triangle fan with 30m triangles in it). If every single triangle was discrete we might reasonably assume that to end up being 10m triangles/second, or 30m vertices/second. The sprite engine draws each sprite as a pair of triangles with two shared vertices so we should be using up approximately 4 vertices per sprite, and so the theoretical simple throughput should be about 7.5m sprites per second, or 7500 sprites per millisecond. And yet I calculate I’m getting just 35 sprites per millisecond.
Clearly something is completely amiss, for the specs to tell me I should be managing 200x more sprites.
To clarify further: this is literally a single call to glDrawElements with one single batch of GL_TRIANGLES, where each triangle is a single pixel, drawn opaquely. Nothing else. And yet I’m 200x slower than I was expecting. Or have I got something seriously wrong with my maths?
Cas
Let me start by stating that I haven’t run this benchmark, so I have no idea what I’m talking about.
Are you sure fillrate isn’t your bottleneck?
Galaxy S II
30 FPS until 1500 crabs.
Definitely. These are 1 pixel sprites. When I increase each sprite to 32x32 the FPS is unchanged.
Cas
Samsung Galaxy S
15 fps @ ~1200
Aha! Bad news! For me, anyway. Just after I went to bed last night (bah, typical) I realised of course that the answer was staring me in the face. It couldn’t possibly be the single call to glDrawElements that was slow. And indeed it wasn’t, when I commented it out. In fact, the call to glDrawElements is so fast it doesn’t even register. Unfortunately it’s my sprite engine that’s slowing things down. Now I’m going to get to grips with Android profiling and find out what bit’s slowest. This is going to be no mean feat - I think I need a factor of 10 speedup :S
Cas
My suspicions are confirmed. Turns out sorting takes almost no time at all. Remove the buffer write though, and I get 11,000 sprites before it drops to 15fps. If I remove the sprite transform/scale/rotate part I get 12,500 sprites, but there’s not a lot I can really do about that code so it’s not going to be optimisable any further. So: mere writing the vertex data is giving me a 5x slowdown, which is very troublesome and suspicious. That is, after all, only 300kb of data in a frame for 2,300 sprites.
Cas
Why not share it, so we can poke at it. Maybe turn it into a competition, for a small prize
That way you can work on your next game, instead of on silly performance problems.
My MappedObject library will probably help you out, as you can completely skip that step.
I fear that I don’t have access to Unsafe in Android. And I am resolute that I will not use native code either
Here’s the offending method:
void add(Sprite s, Style newStyle) {
SpriteImage image = s.image;
GLBaseTexture newTexture0 = image.texture;
if (currentRun == null || newStyle != currentStyle || newTexture0 != currentTexture) {
// Changed state. Start new state.
currentRun = stateRun[numRuns ++];
currentRun.style = newStyle;
currentRun.texture = newTexture0;
currentRun.startIndex = indexCursor;
currentRun.endIndex = indexCursor;
currentStyle = newStyle;
currentTexture = newTexture0;
}
final float tx0 = image.tx0;
final float tx1 = image.tx1;
final float ty0 = image.ty0;
final float ty1 = image.ty1;
final float xscale = s.xscale;
final float yscale = s.yscale;
final float x = s.x + s.ox;
final float y = s.y + s.oy;
final float alpha = s.alpha * ALPHA_DIV;
// First scale then rotate coordinates
float scaledx0 = -image.hotspotx * xscale;
float scaledy0 = -image.hotspoty * yscale;
float scaledx1 = (image.w - image.hotspotx) * xscale;
float scaledy1 = (image.h - image.hotspoty) * yscale;
float scaledx00, scaledx10, scaledx11, scaledx01, scaledy00, scaledy10, scaledy11, scaledy01;
// Then rotate
final double angle = s.angle;
if (angle != 0) {
double angle2 = toRadians(angle);
double cos = cos(angle2);
double sin = sin(angle2);
scaledx00 = (float) (cos * scaledx0 - sin * scaledy0);
scaledx10 = (float) (cos * scaledx1 - sin * scaledy0);
scaledx11 = (float) (cos * scaledx1 - sin * scaledy1);
scaledx01 = (float) (cos * scaledx0 - sin * scaledy1);
scaledy00 = (float) (sin * scaledx0 + cos * scaledy0);
scaledy10 = (float) (sin * scaledx1 + cos * scaledy0);
scaledy11 = (float) (sin * scaledx1 + cos * scaledy1);
scaledy01 = (float) (sin * scaledx0 + cos * scaledy1);
} else {
scaledx00 = scaledx0;
scaledx10 = scaledx1;
scaledx11 = scaledx1;
scaledx01 = scaledx0;
scaledy00 = scaledy0;
scaledy10 = scaledy0;
scaledy11 = scaledy1;
scaledy01 = scaledy1;
}
// Then translate them
final float x00 = scaledx00 + x;
final float x01 = scaledx01 + x;
final float x11 = scaledx11 + x;
final float x10 = scaledx10 + x;
final float y00 = scaledy00 + y;
final float y01 = scaledy01 + y;
final float y11 = scaledy11 + y;
final float y10 = scaledy10 + y;
final ReadableColor[] colors = s.color;
FloatBuffer floats = vertices;
IntBuffer ints = intColors;
int vertex = vertexCursor * VERTEX_SIZE_IN_FLOATS;
int ivertex = vertex;
floats.position(vertex);
ints.position(ivertex);
floats.put(x00);
floats.put(y00);
floats.put(0.0f);
floats.put(s.mirrored ? tx1 : tx0);
floats.put(s.flipped ? ty0 : ty1);
ReadableColor color = colors[0];
ints.put((color.getRed() << 0) | (color.getGreen() << 8) | (color.getBlue() << 16)
| (int) (color.getAlpha() * alpha) << 24);
vertex += VERTEX_SIZE_IN_FLOATS;
ivertex += VERTEX_SIZE_IN_FLOATS;
floats.position(vertex);
ints.position(ivertex);
floats.put(x10);
floats.put(y10);
floats.put(0.0f);
floats.put(s.mirrored ? tx0 : tx1);
floats.put(s.flipped ? ty0 : ty1);
color = colors[1];
ints.put((color.getRed() << 0) | (color.getGreen() << 8) | (color.getBlue() << 16)
| (int) (color.getAlpha() * alpha) << 24);
vertex += VERTEX_SIZE_IN_FLOATS;
ivertex += VERTEX_SIZE_IN_FLOATS;
floats.position(vertex);
ints.position(ivertex);
floats.put(x11);
floats.put(y11);
floats.put(0.0f);
floats.put(s.mirrored ? tx0 : tx1);
floats.put(s.flipped ? ty1 : ty0);
color = colors[2];
ints.put((color.getRed() << 0) | (color.getGreen() << 8) | (color.getBlue() << 16)
| (int) (color.getAlpha() * alpha) << 24);
vertex += VERTEX_SIZE_IN_FLOATS;
ivertex += VERTEX_SIZE_IN_FLOATS;
floats.position(vertex);
ints.position(ivertex);
floats.put(x01);
floats.put(y01);
floats.put(0.0f);
floats.put(s.mirrored ? tx1 : tx0);
floats.put(s.flipped ? ty1 : ty0);
color = colors[3];
ints.put((color.getRed() << 0) | (color.getGreen() << 8) | (color.getBlue() << 16)
| (int) (color.getAlpha() * alpha) << 24);
// Write indices: need 6, for two triangles
indices.position(indexCursor);
indices.put((short) (vertexCursor + 0));
indices.put((short) (vertexCursor + 1));
indices.put((short) (vertexCursor + 2));
indices.put((short) (vertexCursor + 0));
indices.put((short) (vertexCursor + 2));
indices.put((short) (vertexCursor + 3));
indexCursor += 6;
vertexCursor += 4;
currentRun.endIndex += 6;
}
(You’ve seen it before, and I know it’s maybe not as efficient as it could be - just surprised at how inefficient it is. First plan: write everything to an int[] using FloatToIntBits then blat the entire int[] to the bytebuffer. Suspect that might be the best way for Android. Have to do int[] because of a colossal Android performance snafu when using FloatBuffer bulk puts.
Cas
See how my FastMath.sin/cos speeds up that rotation part, although you’re probably not rotating right now. Remove the double
. You can reduce the memory footprint of the lookuptable with the static SIN_BITS
variable.
Only use indexed put/get on buffers. Threat a buffer like an array: do your own managing of indices.
// Write indices: need 6, for two triangles
indices.position(indexCursor);
indices.put(...);
indices.put(...);
// Write indices: need 6, for two triangles
indices.put(indexCursor+0, ...);
indices.put(indexCursor+1, ...);
Ok, by using an int[] array to build vertex data, and putting all the floats into it using Float.floatToRawIntBits(), then copying all the vertex data in one go to the direct buffer. I managed to double performance: got it up to 4600 sprites @ 15fps. With your FastMath.sinDeg/cosDeg methods, I managed to eke a further 10% out and got it up to 5200 sprites @ 15fps.
So that’s about 2x faster basically. It would help now to get it about 5x faster. I’m just going to try absolute puts into a direct IntBuffer and avoid the int[] array copy and see how that improves things…
Cas
I think that’s too much bytecode for a single method. Your implementation must look much different now, but here’s how I’d start optimizing:
void add(Sprite s, Style newStyle) {
SpriteImage image = s.image;
GLBaseTexture newTexture0 = image.texture;
checkState(newStyle, newTexture0);
// First scale then rotate coordinates
float scaledx0 = -image.hotspotx * s.xscale;
float scaledx1 = (image.w - image.hotspotx) * s.xscale;
float scaledy0 = -image.hotspoty * s.yscale;
float scaledy1 = (image.h - image.hotspoty) * s.yscale;
final float x = s.x + s.ox;
final float y = s.y + s.oy;
final float x00, x01, x11, x10;
final float y00, y01, y11, y10;
// Then rotate & translate
final double angle = s.angle;
if ( angle != 0 ) {
double angle2 = toRadians(angle);
float cos = (float)cos(angle2);
float sin = (float)sin(angle2);
x00 = cos * scaledx0 - sin * scaledy0 + x;
x01 = cos * scaledx0 - sin * scaledy1 + x;
x11 = cos * scaledx1 - sin * scaledy1 + x;
x10 = cos * scaledx1 - sin * scaledy0 + x;
y00 = sin * scaledx0 + cos * scaledy0 + y;
y01 = sin * scaledx0 + cos * scaledy1 + y;
y11 = sin * scaledx1 + cos * scaledy1 + y;
y10 = sin * scaledx1 + cos * scaledy0 + y;
} else {
x00 = scaledx0 + x;
x01 = scaledx0 + x;
x11 = scaledx1 + x;
x10 = scaledx1 + x;
y00 = scaledy0 + y;
y01 = scaledy1 + y;
y11 = scaledy1 + y;
y10 = scaledy0 + y;
}
FloatBuffer floats = vertices;
IntBuffer ints = intColors;
final float tx0 = s.mirrored ? image.tx1 : image.tx0;
final float tx1 = s.mirrored ? image.tx0 : image.tx1;
final float ty0 = s.flipped ? image.ty1 : image.ty0;
final float ty1 = s.flipped ? image.ty0 : image.ty1;
final float alpha = s.alpha * ALPHA_DIV;
final ReadableColor[] colors = s.color;
int vertex = vertexCursor * VERTEX_SIZE_IN_FLOATS;
int ivertex = vertex;
floats.position(vertex);
ints.position(ivertex);
putVertex(floats, ints, x00, y00, tx0, ty1, colors[0], alpha);
vertex += VERTEX_SIZE_IN_FLOATS;
ivertex += VERTEX_SIZE_IN_FLOATS;
floats.position(vertex);
ints.position(ivertex);
putVertex(floats, ints, x10, y10, tx1, ty1, colors[1], alpha);
vertex += VERTEX_SIZE_IN_FLOATS;
ivertex += VERTEX_SIZE_IN_FLOATS;
floats.position(vertex);
ints.position(ivertex);
putVertex(floats, ints, x11, y11, tx1, ty0, colors[2], alpha);
vertex += VERTEX_SIZE_IN_FLOATS;
ivertex += VERTEX_SIZE_IN_FLOATS;
floats.position(vertex);
ints.position(ivertex);
putVertex(floats, ints, x01, y01, tx0, ty0, colors[3], alpha);
putIndices(indices, indexCursor, vertexCursor);
indexCursor += 6;
vertexCursor += 4;
currentRun.endIndex += 6;
}
private void checkState(Style newStyle, GLBaseTexture newTexture0) {
if ( currentRun == null || newStyle != currentStyle || newTexture0 != currentTexture ) {
// Changed state. Start new state.
currentRun = stateRun[numRuns++];
currentRun.style = newStyle;
currentRun.texture = newTexture0;
currentRun.startIndex = indexCursor;
currentRun.endIndex = indexCursor;
currentStyle = newStyle;
currentTexture = newTexture0;
}
}
private static void putVertex(FloatBuffer floats, IntBuffer ints, float x, float y, float tx, float ty, ReadableColor color, float alpha) {
floats.put(x);
floats.put(y);
floats.put(0.0f);
floats.put(tx);
floats.put(ty);
ints.put((color.getRed() << 0) | (color.getGreen() << 8) | (color.getBlue() << 16) | (int)(color.getAlpha() * alpha) << 24);
}
private static void putIndices(ShortBuffer indices, int indexCursor, int vertexCursor) {
// Write indices: need 6, for two triangles
indices.put(indexCursor + 0, (short)(vertexCursor + 0));
indices.put(indexCursor + 1, (short)(vertexCursor + 1));
indices.put(indexCursor + 2, (short)(vertexCursor + 2));
indices.put(indexCursor + 3, (short)(vertexCursor + 0));
indices.put(indexCursor + 4, (short)(vertexCursor + 2));
indices.put(indexCursor + 5, (short)(vertexCursor + 3));
}
I’ve extracted a few methods and reorganized the code for better stack locality. Looks like vertex == ivertex always, so that could be cleaned-up further, but I wasn’t sure.
It is indeed a big method but unfortunately Dalvik doesn’t do any useful inlining, so method calls are to be avoided.
Just tried directly writing to the IntBuffer using absolute put - much slower than writing to int[] array and copying it all at the end. So there we have it: got about 5000 sprites @ 15fps, or, 1250 or so at a glassy smooth 60fps on the Galaxy 2. I suppose that’s livable with if I curb my expectations a little and make sure Chaz doesn’t go overboard with the particle effects. I think in reality I really need another 2x speedup and as I say, probably not much chance of that happening without proper inlining, more buffer access “intrinsification”, bounds check hoisting, peephole optimisation, etc. in Dalvik, for which I won’t be holding my breath.
Anyway: latest version, once again with actual crabs, is here. One tap makes 100 crabs. Lifecycle still buggered, turned the music off though.
Cas