Hm, might I assume also then that blatting the data into float arrays and then writing the completed float arrays to buffers is going to be fastest…
Cas
Hm, might I assume also then that blatting the data into float arrays and then writing the completed float arrays to buffers is going to be fastest…
Cas
Riven, Cas: Yes. I think using the array would be best.
Sounds like a case for shell sort.
My experience with VBOs (setting up a buffer per triangle group, populated at load-time) was that they produced no measurable performance difference on my box, even though model rendering was a bottleneck. GeForce 8400GS.
I have a hard time believing that populating a 1000 sprite buffer is even remotely a bottleneck. Or a 5000 sprite buffer. Or a 5000 sprite buffer that’s 4x times slower to fill than what it should be.
This is very peculiar, I have a 8500 that sees 3-4x performance boost when using vbos compared to normal vertex arrays.
I think the main problem is the number of state changes I have to do as a result of too-fine Y sorting. I absolutely have to do Y sort because my sprites have to appear in front of each other when they overlap - but there’s the key: they don’t have to if they don’t overlap. So I am going to adjust the Y sort to a “band sort”, where I take into account the Y hotspot and height of the image to be displayed. Unfortunately that’s probably going to break my neat radix sort. But it may cut down on state changes by an order of magnitude. One issue I have noticed is texture state change thrashing - drawing a sprite using one texture then switching to another texture to draw another sprite, over and over, one sprite at a time, for the irritating case where the images are placed in different texture atlases. I need some way of optimising this, either by getting statistics from a run and applying them to the sprite packer, or by some dynamic reorganisation of the atlases at runtime.
I’ve now interleaved the writes and draws per render state, so that I write a few sprites out at a time, set up the render state, call glDrawArrays, and then start immediately on the next set of sprites.
I’m going to implement VBOs and see if they improve the asynchronicity of glDrawArrays. BTW is glDrawArrays the best way to do this or would glDrawRangeElements be better (given the totally sequential nature of my vertices I suspect not but it’s worth asking…?)
Cas
Off the top of my head, I know I’ve been through my own battle over glDrawArrays, glDrawElements and glDrawRangeElements. glDrawRangeElements is definitely the way to go over glDrawElements, although performance differences aren’t significant it never hurts anything. For me, on a 8500MT card I found that glDrawArrays was slower than using glDrawRangeElements with indices that were sequential. This was done a long time ago and was fairly informal, so my best bet is to try it out.
Optimisation continues:
With Y sort on my “typical scene” of about 1300 sprites takes 150 state changes to draw.
With Y sort completely turned off, it takes 50 state changes to draw.
With band sorting, where I use an interval tree to determine if sprites are overlapping, I get a pretty similar 60 or so state changes to draw the same scene. This is a massive reduction in state changes and calls to glDrawArrays - however, the interval tree code I found (in OpenJDK) is so slow, my frame rate plummets to 10fps
I’m now working on getting a much much faster interval tree of my own implementation. If that achieves any speedup I’ll let you know…
Cas
What was the performance difference between 50 and 150 state changes?
If you have a lot of sprites that potentially aren’t moving at once, you can try drawing them once into an FBO and then draw the FBO instead. Only update the rects in the FBO that have objects which have moved. I’m not sure if this will be useful for you, but it has been really big for my own optimizations because I have a lot of objects that don’t move all the time.
Nah, dirty rects almost completely useless for me - we have a lot of movement and a lot of overdraw.
The speedup difference between 50 and 150 draw calls was actually fairly negligible - maybe a 10% boost. However… I finally put VBOs into the sprite engine. Woah! What a difference. 60fps nearly all the time, just like that. Because of how VBOs work I couldn’t interleave writes and draws on a state-change basis any more; it’s back to writing all the sprites in one go and then rendering everything in one go. But even so - much faster.
One thing in particular may be helping here, which is that I use GL_STREAM_DRAW_ARB and GL_WRITE_ONLY_ARB and map the VBO. This means the data that I write to the buffers is written straight to the card, probably even bypassing AGP RAM, and especially importantly, it completely bypasses any RAM caches on the way.
So: VBOs FTW! I can still implement band sorting but I’ll have to write my own much faster interval tree class.
And then I’ll have a look at optimising the sprite atlases depending on adjacent sprite image usage and feeding back a data file into the sprite packer.
After that it’ll very likely be back to being fill-rate limited like it was 7 years ago when I first wrote the damned thing! But this time I can fill 30x as many pixels
Cas
JWS or it is all a lie.
Cas
That’s awesome! I’ll have to look into VBOs in the future.
Do it now! Any other method is totally obsolete by the looks of things.
Cas
do you plan to keep that as default? or have some non vbo fallback too?
I’m supposed to be releasing on Monday! Changing the whole drawing system is probably not worth it at this point. ;D Maybe for updates, though. The game stays around 30 FPS but I’ve already got a few optimizations in mind that I think could definitely get it to around 60.
We will not support cards without fully functional VBO. No fallback will be provided.
Anything less than 60fps is rubbish So you go ahead and release, in shame!
Cas
It’s for the iPhone. You try getting 60 FPS when you have no dedicated graphics card and OpenGL 1.1.
On the 3GS it never goes below 60 FPS.
Does the iPhone have VBOs?
Cas