FloatBuffer and Batching

ryeTech · April 5, 2014, 10:47pm

After a few weeks, I have finally created my basic sprite batcher!
It uses a VBO (DYNAMIC creation hint) and IBO (Static creation hint), where the IBO is pre-filled with the needed index data.
Whenever I need to draw a sprite I have method that places the vertex and texture data into a giant float array.
Then when I reach my sprite limit or need to change textures, I shoot everything to the GPU!
Basically I call a single Put method on a floatbuffer, where my giant float array is the data source, and I draw everything needed.



mySuperFloatBuffer.put(myGiantFloatArray);

So whats left? The only thing left… performance tests… unfortunately I don’t like the numbers too much

I’m testing my batcher against its STATIC counterpart. Meaning my VBO is set as STATIC, my vertex data (X Y Z) is pregenerated, and the floatbuffer used by the VBO is prefilled.
The other way, I’m writing over the floatbuffer every frame with random vertex data from my giant float array using a single Put call.

Now here are the numbers
This is using 20,000 textured quads(sized 32 x 32) over a 800x480 screen space

Dynamic (Normal Way): Delta Time = 0.048 seconds or 22.72 FPS
Static (Testing Purpose way): Delta Time = 0.020 seconds or 50 FPS

So as you can see that is a pretty large difference, so what gives?
I know I won’t be able to get them completely matching. Since one way I’m changing the data every frame and the other way is prefilled never changing.
BUT I feel like my numbers should be a bit better

My vertex and fragment shaders are extremely simple. The vertex shader just gets pretransformed vertex points. The fragment shader just uses the texture, no manipulation of any kind.
I believe my problem lies on the CPU side. Specifically, how I treat vertex data, I mean I am doing this

Raw vertex data into giant float array then one big copy using the Put method to place it the floatbuffer. Before finally sending it to gpu using glBufferSubData



//In my draw method
myGiantFloatArray[currentSize] = 0.0f;
myGiantFloatArray[currentSize + 1] = 0.0f;
myGiantFloatArray[currentSize + 2] = 0.0f;
myGiantFloatArray[currentSize + 3] = 1.0f;
/* other vertex data */

//------------------------------------------------------

//In my Render (flush batch) method
mySuperFloatBuffer.put(myGiantFloatArray);
mySuperFloatBuffer.position(0);

/*Other openGL stuff  */
glBufferSubData(/*Params*/);

/*Any remaining stuff and index draw call */

My real heavy hitter in that above code is the Put. Taking up 0.010 seconds or so of my precious time!
So finally, is there a better way to handle things on the CPU side? Any help would be greatly appreciated.
Let me know if I need to add more info on anything

Thanks!

By the way, I’m using openGL ES 2.0, which does not have any glMapBuffer calls

princec · April 6, 2014, 9:57am

The fundamental mistake you are making is to write your data out to a float array first. One of the whole points of VBOs is that you do not place your vertex data in an intermediary buffer. There are three reasons for this:

Firstly you’ve got to allocate twice the RAM for no reason. However this isn’t as important as…
… writing to that intermediary buffer just trashes your data caches. The idea of of write-only VBOs is that the caches are ignored and written straight past, leaving all the data in the caches that you need to carry on computing vertex data without just flushing it out constantly
… and in any case you’re still doing twice the memory movement, writing it out once, and then writing it out again.

Cas

ryeTech · April 6, 2014, 11:24pm

I know I’m doing some bad things here, but I’m not sure how else to do it

The core reason I place everything into a giant float array first and then into the floatbuffer is that the single Put call has been the fastest.
The other way (using many Put calls. EG indexed Put, a basic Put, or etc) has limited me to only being able to hit 5,000 or less textured quads while maintaining a measly 15 - 20 FPS



//------[+] This is way faster i have found so far!

//In my draw method
myGiantFloatArray[currentSize] = 0.0f;
myGiantFloatArray[currentSize + 1] = 0.0f;
myGiantFloatArray[currentSize + 2] = 0.0f;
myGiantFloatArray[currentSize + 3] = 1.0f;
/* other vertex data */

//------------------------------------------------------

//In my Render (flush batch) method
mySuperFloatBuffer.put(myGiantFloatArray);
mySuperFloatBuffer.position(0);

/*Other openGL stuff  */
glBufferSubData(GL_ARRAY_BUFFER, 0, currentDataSize * BYTES_PER_FLOAT, mySuperFloatBuffer);
		

//=======================================================================

//---------[+] This is way slower (Many basic Put calls) :(

//In my draw method
mySuperFloatBuffer.put(0.0f);
mySuperFloatBuffer.put(0.0f);
mySuperFloatBuffer.put(0.0f);
mySuperFloatBuffer.put(1.0f);
/* other vertex data */

//------------------------------------------------------

//In my Render (flush batch) method
mySuperFloatBuffer.position(0);

/*Other openGL stuff  */
glBufferSubData(GL_ARRAY_BUFFER, 0, currentDataSize * BYTES_PER_FLOAT, mySuperFloatBuffer);

Also just as a ‘in case’ this is how it all comes together



//In my main Render Method

batcher.begin(); //Setup the batch basics (Shader program, resets, etc)

//Call the draw method
for(int i = 0; i < 20000; i++)
{
    batch.draw((float)rand.nextInt(480), (float)rand.nextInt(800), texture2);	
}

batch.end(); //Call the Flush method and any other ending items

Could you explain more along the lines of what you are thinking? I’m not sure as to how to efficiently place data into the VBO. As I said before, unfortunately I can not map directly into it

princec · April 7, 2014, 8:33am

There were some gotchas about the fastest way to call put() in a FloatBuffer but I can’t remember exactly what they were. Personally I use .put() just fine and I’m managing an order of magntitude more sprites than you are - maybe the problem lies elsewhere? (Have you profiled it? try -Xprof on the commandline)

Cas

deathpat · April 7, 2014, 8:58am

I had exactly the same issue as ryeTech when doing my batcher for Daedalus, and came up with the same solution: using an array next to the FloatBuffer.
I came to this after doing some profiling on the game ( not on an isolated test case ), I noticed that the put method was very costly (CPU-wise), and it was really faster to have a separate array and doing only one call to put(float[]). After switching to an array next to the FloatBuffer, I gained in CPU and FPS ( 'cause I was CPU limited )

If there is any other way to populate the FloatBuffer without degrading the performance ( for me memory usage was not an issue ), I’m all ears

Just a note: I’m talking about cases where you need to refresh most or all of the VBO data ( meaning doing like 10000 put calls on the FloatBuffer in my case ), otherwise I’m pretty sure that a handful of put() calls should be at least as fast as a put(float[]).

princec · April 7, 2014, 9:07am

The Hotspot VM is supposed to intrinsify the put() call such that it should be identical in performance to the float[] access… maybe this isn’t happening for some reason?

Cas

Nate · April 7, 2014, 9:10am

SpriteBatch in libgdx caches in a float[] and then flushes to a VA. This is the fastest way to do it. See the Sprite shootout thread on JGO. This is for geometry that changes each frame. If yours doesn’t, see SpriteCache which writes to a VBO.

deathpat · April 7, 2014, 9:17am

even on a direct FloatBuffer ? I have no idea how it works internally but I imagine that accessing a direct buffer is a bit different than a non-direct one

princec · April 7, 2014, 11:23am

Theoretically yes. But then Java performance theory has always been a bit of a vague and slippery concept which seems to differ from practice rather a lot.

Cas

Orangy_Tang · April 7, 2014, 1:16pm

Last time I did this it was indeed roughly identical… on desktop VMs. The android vm was much, much slower and combining everything into a big float[] was way faster. Probably why libgdx is doing it this way too.

loom_weaver · April 7, 2014, 3:09pm

I’ve been exploring VBOs recently but on the desktop. java.nio.FloatBuffers for vertices and textures + ByteBuffers for colors gives pretty good performance. glBufferData/glBufferSubData is giving 1.5 msec render time for 160000 quads. Note that I’m not put’ing every vertex every frame though.

No idea about ES and your platform though but have you confirmed that desktop performance is adequate?

ryeTech · April 8, 2014, 1:15am

I’m back… with numbers!

So I have been running some numbers again. I decided to ‘level’ the playing field with my test cases

Before I was trying to bench mark against a Static Prefilled X Y Z VBO vs a Dynamic Non-prefilled (new X Y every frame)
I felt like I was benchmarking how well the VM would generate random numbers for me. So I ran this test instead! And remember this is using a single Put with a giant float array

[tr]
[td]
My Test Case:
X number of Quads
32x32 Size
Textured
Tinted Blue
Pregenerated Random X & Y location
VBO and IBO (IBO is Static Prefilled)
[/td]
[td]
System Info:
Android 4.1.2 (JellyBean)
CPU - Dual-Core 972 MHz
GPU - Qualcomm Adreno 305
Resolution - 480x800
[/td][/tr]

[tr]
[td]
[table][tr][td]Number of Quads[/td][/tr]
[tr][td]10000[/td][/tr]
[tr][td]20000[/td][/tr]
[tr][td]30000[/td][/tr]
[tr][td]40000[/td][/tr]
[tr][td]50000[/td][/tr]
[tr][td]-------[/td][/tr]
[tr][td]10000[/td][/tr]
[tr][td]20000[/td][/tr]
[tr][td]30000[/td][/tr]
[tr][td]40000[/td][/tr]
[tr][td]50000[/td][/tr]
[tr][td]-------[/td][/tr]
[tr][td]10000[/td][/tr]
[tr][td]20000[/td][/tr]
[tr][td]30000[/td][/tr]
[tr][td]40000[/td][/tr]
[tr][td]50000[/td][/tr]
[/td]

[td]
[tr][td]VBO Hint[/td][/tr]
[tr][td]Dynamic[/td][/tr]
[tr][td]Dynamic[/td][/tr]
[tr][td]Dynamic[/td][/tr]
[tr][td]Dynamic[/td][/tr]
[tr][td]Dynamic[/td][/tr]
[tr][td]----------[/td][/tr]
[tr][td]Stream [/td][/tr]
[tr][td]Stream [/td][/tr]
[tr][td]Stream [/td][/tr]
[tr][td]Stream [/td][/tr]
[tr][td]Stream [/td][/tr]
[tr][td]----------[/td][/tr]
[tr][td]Static [/td][/tr]
[tr][td]Static [/td][/tr]
[tr][td]Static [/td][/tr]
[tr][td]Static [/td][/tr]
[tr][td]Static [/td][/tr]
[/td]

[td]
[tr][td]Delta Time (seconds)[/td][/tr]
[tr][td]0.017 - 0.016 [/td][/tr]
[tr][td]0.033 - 0.032 [/td][/tr]
[tr][td]0.050 - 0.049 [/td][/tr]
[tr][td]0.065 - 0.064 [/td][/tr]
[tr][td]0.082 - 0.081 [/td][/tr]
[tr][td]-----------------[/td][/tr]
[tr][td]0.017 - 0.016 [/td][/tr]
[tr][td]0.031 - 0.032 [/td][/tr]
[tr][td]0.048 [/td][/tr]
[tr][td]0.065 - 0.064 [/td][/tr]
[tr][td]0.081 - 0.080 [/td][/tr]
[tr][td]-----------------[/td][/tr]
[tr][td]0.019 [/td][/tr]
[tr][td]0.037 - 0.036 [/td][/tr]
[tr][td]0.056 - 0.055 [/td][/tr]
[tr][td]0.073 - 0.072 [/td][/tr]
[tr][td]0.092 - 0.090 [/td][/tr]
[/td]

[td]
[tr][td]FPS[/td][/tr]
[tr][td]58.82 - 62.5 [/td][/tr]
[tr][td]30.30 - 31.25 [/td][/tr]
[tr][td]20 - 20.40 [/td][/tr]
[tr][td]15.38 - 15.62 [/td][/tr]
[tr][td]12.19 - 12.34 [/td][/tr]
[tr][td]-----------------[/td][/tr]
[tr][td]58.82 - 62.50 [/td][/tr]
[tr][td]31.25 - 32.26 [/td][/tr]
[tr][td]20.83 [/td][/tr]
[tr][td]15.38 - 15.62 [/td][/tr]
[tr][td]12.34 - 12.50 [/td][/tr]
[tr][td]-----------------[/td][/tr]
[tr][td]52.63 [/td][/tr]
[tr][td]27.02 - 27.77 [/td][/tr]
[tr][td]17.85 - 18.18 [/td][/tr]
[tr][td]13.69 - 13.88 [/td][/tr]
[tr][td]10.86 - 11.11 [/td][/tr]
[/td][/tr][/table]

As for setting the quad to a new location every frame (new random values [2 randoms per draw call); I’m still bottoming out at 20,000 quads with a .043 Delta Time (seconds) or roughly 23 FPS

If anyone has any better way to increase my numbers that would be awesome! I want to make sure I am truly at my limit

loom_weaver · April 8, 2014, 3:16am

What are your metrics like without iterating over every quad and calling Random?

Even on my system, if I iterate over every one of the 160000 quads and change the vertices using Random, it’s quite slow… i.e. +15 msec per render frame where without the iteration+Random it’s 1.5 msec/frame using VBO+glBufferSubData. This is not something I would even consider as there shouldn’t be any need to visit every single quad every single frame in this manner for a normal game.

Once you take this iteration out of the picture and the calls to Random you should be able to get better metrics of the throughput of memory to the video card. Have you tried modifying just a small subset of the quads just so that you can verify that your render loop is picking up changes?

princec · April 8, 2014, 9:27am

Sorry, I hadn’t realised you were on Android and doing this stuff, in which case do ignore any advice I’ve given you and listen to Nate instead. On Android:

VBOs are not actually “accelerated” in any way at all, they’re just system RAM and slow as hell and
Buffers aren’t intrinsified at all and
Dalvik doesn’t have any of that clever stuff with inlining and so on

Cas

ryeTech · April 9, 2014, 12:40am

When I rig it to use 20,000 quads and only update the first 10,000 with a new position every frame.
I get a delta time of 0.036 - 0.026 (38.46 to 27.77 FPS)

So that further solidifies that the new random value each frame is the real killer just like my metric pointed out.
I just wish it could be faster you know

Are you saying that VBO are really doing nothing for me? Should I just use a plan old vertex array then?
Can I use a IBO then too? I can remember if the IBO requires a VBO (I don’t think it does)

Also can you explain 2 and 3. I’m not really too sure what ‘intrinsified’ buffers mean, nor do I know what you mean by inlining

Riven · April 9, 2014, 7:04am

Are you sure you are measuring VBO upload performance and that you are not capped by your GPU fillrate? Make sure every sprite is 1px or less (like having all geometry outside the viewport).

Additionally, you can create a PRNG with reasonable quality output that’s way faster than Random.next() / Math.random()

princec · April 9, 2014, 10:51am

ryeTech:

loom_weaver:

What are your metrics like without iterating over every quad and calling Random?

Even on my system, if I iterate over every one of the 160000 quads and change the vertices using Random, it’s quite slow… i.e. +15 msec per render frame where without the iteration+Random it’s 1.5 msec/frame using VBO+glBufferSubData. This is not something I would even consider as there shouldn’t be any need to visit every single quad every single frame in this manner for a normal game.

When I rig it to use 20,000 quads and only update the first 10,000 with a new position every frame.
I get a delta time of 0.036 - 0.026 (38.46 to 27.77 FPS)

So that further solidifies that the new random value each frame is the real killer just like my metric pointed out.
I just wish it could be faster you know

princec:

Sorry, I hadn’t realised you were on Android and doing this stuff, in which case do ignore any advice I’ve given you and listen to Nate instead. On Android:

VBOs are not actually “accelerated” in any way at all, they’re just system RAM and slow as hell and

Buffers aren’t intrinsified at all and

Dalvik doesn’t have any of that clever stuff with inlining and so on

Cas

Are you saying that VBO are really doing nothing for me? Should I just use a plan old vertex array then?
Can I use a IBO then too? I can remember if the IBO requires a VBO (I don’t think it does)

Also can you explain 2 and 3. I’m not really too sure what ‘intrinsified’ buffers mean, nor do I know what you mean by inlining

You’ll be needing to look up terms like this in the near future if you’re going to embark upon a career as a programmer, but I’ll start the ball rolling:

“Intrinsification”, which may not actually be a real word outside of compiler design circles, is where the compiler detects some fairly complex high-level code eg. Math.sqrt(), FloatBuffer.put(), and replaces the function call with a single machine code instruction (or maybe several), thus making that code as fast as it is possible to be. The desktop JVMs do a lot of this, but the Dalvik VM isn’t quite so clever and doesn’t manage to do so much of it.

“Inlining” is where a small method, eg. public int getX() { return x; } is simply copied verbatim into the callsite - rather like cut n paste on the fly. Instead of pushing a bunch of arguments on the stack, jumping to a subroutine in a totally different area of memory, executing the code, and then popping the return value off the stack, the code is just executed in place, saving all those shenanigans from happening and providing a huge speedup. Inlining can be recursive, that is, inlined functions may themselves have inlined functions. It’s tuneable with some commandline args on the desktop VMs. Again, though, the Dalvik VM doesn’t appear to do much in the way of inlining, though the latest versions of Android might have improved it a bit.

Intrinsification and inlining are two of the reasons why Java has made such leaps and bounds in speed versus C++ over the last decade. There are a bunch more things that also help a lot such as bounds check elimination, escape analysis, monomorphic callsite detection, loop unrolling, lock elision, and huge advances in garbage collection and allocation strategy… again, none of which made it in to Dalvik. (You can search these very boards for discussions about all those things, and Google will provide further information).

And finally…

yes, VBOs gain you absolutely diddly squat on any current ARM devices. Same goes for iOS as Android. There is no discrete GPU memory, no separate DMA bus, and usually, what bus there is, is a crappy 16 bit or maybe 32 bit wide one anyway. The only reason for VBOs on ARM chipsets is that a) one day they might have these things though this is a tenuous reason at best and b) it makes it rather easier to port the same code between desktop and ARM devices. See libgdx. Yes, you can use an index array with plain vertex arrays.

Cas

ryeTech · April 10, 2014, 12:27am

I went and tried this (using the same test case I had except the quads were 1x1). I’m getting a DT of 0.082 - 0.084 at 85,000 quads. This about the same as the metric for the 32x32 quad test I have, where I’m only pulling 50,000 quads.

So it looks like I’m GPU fillrate bound right, since my quad amount increased when the were only a pixel big?

That makes me sad…

Riven · April 10, 2014, 6:24am

Yes, fillrate capped.

ryeTech · April 10, 2014, 3:30pm

Riven my friend that makes me sad a little bit

I’ll need to go back over my stuff and see where I can gain more numbers. Although, I think I’m at my limit.
I see that I get a better DT of 0.077 secs with 50,000 quads (same test case) when I don’t use VBOs.

Which makes sense since I would no longer need the glBufferSubData call and Prince says VBOs arent doing anything for me anyway