VBO

StumpyStrust · September 15, 2012, 8:39am

So I am really mad right now. Have been working for 2 days on getting a dynamic vbo to work…basically a sprite batcher of sorts and can’t get anything right. I have read probably every tutorial that you can find by googling VBO, dynamic VBO, interleaved VBO…I even have looked through libgdx’s source for enlightenment.

Right now I have stuff actually showing up on screen without all sorts of black crashes. I am trying to use an interleaved VBO with Vertex Color Texture. Vertex is 2f Color is 4f, and Texture is 2f.

I have no idea if I am doing anything right so here is what I have. Tell me what is horribly wrong without being too rude.

vctBuffer = BufferUtils.createFloatBuffer(vert.size());
		vctBuffer.put(vert.shrink());
		vctBuffer.flip();
        
        
        int stride = (2+4+2)*4;
		bufferData(vctPointer, vctBuffer);
		Util.checkGLError();
		int offset = 0 * 4;
        GL11.glVertexPointer(2, GL11.GL_FLOAT,stride, offset);
        Util.checkGLError();
        offset = 4 * 4;
        GL11.glColorPointer(4, GL11.GL_FLOAT, stride, offset); 
        Util.checkGLError();
        offset = (4+2) * 4;
        GL11.glTexCoordPointer(2, GL11.GL_FLOAT, stride, offset); 
        Util.checkGLError();
       
        Util.checkGLError();
        
        Util.checkGLError();
        GL11.glDrawArrays(GL11.GL_QUADS, 0, draws*4);

        
        GL11.glDisableClientState(GL11.GL_VERTEX_ARRAY);
        GL11.glDisableClientState(GL11.GL_TEXTURE_COORD_ARRAY);
        GL11.glDisableClientState(GL11.GL_COLOR_ARRAY);
	    draws = 0;
	    index = 0;
	    vctBuffer.clear();
	    GL15.glBindBuffer(GL15.GL_ARRAY_BUFFER, 0);

I am got a version working using vertex arrays that could get 40k sprites before fps drops but I want closer to 50k stable 150k drop fps. I know about fill rate limit and I am testing with very small sprites, 4 pixels.

The vert.shrink() stuff is basically my custom array class that grows if it is too small. The size() returns the number of indexs used and the shrink() returns and float[] of the indexes.

If any one could also explain how to use the drawElements command I would also be very happy. I know what it does and why it can be faster but just don’t get how to create the indices stuff. I am really put out at how ridiculously hard this is as there are very few tutorials on this stuff that go further then drawing/rendering a static cube/triangle.

Riven · September 15, 2012, 9:35am

Maybe this helps?

Sure, this tutorial handles only drawing a triangle, in various ways, but I expect you can move on from there pretty quickly, as you’re working from a base that works.

theagentd · September 15, 2012, 12:20pm

If your code works (your first paragraph implied that it wasn’t), you’ve got basic VBOs working fine. I can give you some performance tips though:

Using floats for your color data is waaay more precision than you need. Just go with 4 bytes for RGBA. Even if you don’t have alpha, include it anyway to keep the data a little better aligned.
You could also use (unsigned) shorts for texture coordinates. Shorts have enough precision by far (up to 65536x65536 textures, which aren’t even supported by hardware), so they’ll be fine with tiled images too.
Don’t use glBufferData(). Instead, use glMapBuffer(), which gets you a ByteBuffer. Either fill this ByteBuffer directly with your sprite data, or just copy the data you have in vctBuffer to it and then unmap it. The first is of course the fastest but a little less flexible, especially if you need to dynamically expand the buffer, so go for the second one since it’s easily implemented (I can post code).

Simply minimizing the data will reduce it to 16 bytes per vertex, down from 32 bytes. You’re currently transferring (32 bytes * 4 vertices * 50 000 sprites * 60 FPS) = 366 MBs of data per second to your graphics card. This is probably your main bottleneck, so focus on that. I can help you with glMapBuffer() after you get that working, since removing glBufferData() avoids an extra copy operation of the data to the driver.

To use glDrawElements() you need (at least) two VBOs, one (or more) with vertex data and one with indices. Using the indices, glDrawElements() creates primitives from the vertices supplied. This is very useful for 3D models, since the same vertices can be reused for 4+ triangles, and the vertex matrix transformations etc are only performed once. Sprites are just a quad made of four vertices, and nothing is shared between sprites. The only time that sprites can benefit from indices is if you can’t use GL_QUADS (= on Android or on OpenGL 3+ since it was deprecated). In that case you can create the four vertices just like you do now for GL_QUADS, then create indices to create two triangles to form a quad from 4 vertices and 3*2=6 indices. Without indices, you’d need 6 vertices to create those two triangles, since you’d have to duplicate the 2 shared vertices. Assuming the above 16 bytes per vertex:

With indices: (16 bytes * 4 vertices) + (6 short indices) = 64 + 12 = 76 bytes.
Without indices: 16 bytes * 6 indices = 96 bytes + 33.3% less vertex processing.

EDIT: I forgot another limitation: 32-bit indices are very slow, so in practice you’re limited to 16-bit indices. That gives you a maximum of 65 536 possible indices, meaning that you’re limited to drawing 16 384 sprites per batch. In practice that’s rarely a problem for sprites since you usually need change some OpenGL state for different sprites, in which case you can’t batch together many of them anyway.

TL;DR: I don’t see how using indices will improve anything for you.

StumpyStrust · September 15, 2012, 5:18pm

It wasn’t working but I got it working except I think the offsets are all wrong because it looks all funky. I read everywhere the interleaving data can give a performance boost as does drawElements. If I use different data types I will need to split things up right?

So to use drawElements I need another VBO full of the indices that will be used. What are the indices? 0 1 2 2 3 0 Like you would with glbegin and end? I may not use it I just want to know how.

Revin I have read that over and over and the problem with it is that every thing is very basic. That is probably the one that made most sense to me and helped a lot.

I know that inevitably I will be fillrate limited so any tips would also help.

theagentd · September 15, 2012, 10:12pm

You do not have to split things even if you use different data types. The easiest way is to just use a ByteBuffer instead of a FloatBuffer and use the put***() methods (putFloat(), putShort(), put() <— last one is for bytes). You can then do everything as usual, but the stride and offsets will be different.

Your current stride and offsets are looking good. Could you post more code or a screenshot if you still have problems?

You almost certainly won’t be fill-rate limited unless your sprites are very big. Your main bottleneck will most likely be your CPU and your RAM speed. I have a particle test which draws colored GL_POINTS particles using VBOs, glMapBuffer() and even MappedObject to minimize the amount of data that has to be handled and uploaded by the CPU, yet I am still easily bottlenecked by my CPU. Furthermore my points only take up 12 bytes since I don’t have any texture coordinates for them.

This little particle engine can handle 700 000 particles at 60 FPS using only one CPU core with a point size of 1 (= exactly one pixel covered per particle). That’s only 700 000 pixels filled per frame. However, since I am CPU limited, I can increase the point size without any FPS drop up to 13 and still stay fixed at 60 FPS. This is a fillrate of 700 000 * 13 * 13 = 118 300 000 pixels per frame. Yes, that’s 118 MILLION pixels per frame, or around 7.1 billion pixels per second (60 FPS). My GPU, equivalent to an NVidia GTX 275, has a theoretical pixel fillrate of 16.1 billion pixels per second. A brand new high-end card has twice that performance.

The thing is, I’m not rendering quads. I’m rendering points, so each particle comes from only a single vertex. You have 4 vertices per quad. My program is CPU limited, so having 4 times as much data to create quads that fill the same number of pixels is going to give 4 times as much CPU work while the fillrate requirements stay the same. Since I’d only be able pump out 1/4th as many particles as before with quads, I could double the particle size (= 4x the pixel area) and still not be GPU limited. Add that I am using MappedObjects (around a 40% performance boost) and glMapBuffer() (around 40% more) and you quickly realize that with quads you’d be able to handle around 100 000 sprites due to the increased data size and not using glMapBuffer(). MappedObjects can’t really be used well for sprites… And sure enough, rendering only 100 000 particles allow me to buff up the point size to 42.

Don’t get me wrong here! I’m not saying that your code or quads are bad! Quads are a lot more flexible (texturing, etc.). I’m just saying that simply handling the data takes a lot of CPU processing power and that GPUs are really fast. Rendering 100 000 sprites without any FPS drop is definitely attainable, but you won’t be fillrate limited unless your sprites are larger than around 40x40 pixels.

100 000 particles covering 42x42 pixels each (FPS drop due to print screen):

http://imageshack.us/a/img6/3101/particlesm.png

StumpyStrust · September 16, 2012, 12:48am

I got it working properly now as it turns out. So get a byte buffer from mapBuffers and put everything in that and still use offsets hmmm…I will work on this and see what I get.

Right now I can get 50k with 28-30fps. Really sad I know but I will get there I hope.

I want some particles to be 256*256 max. Maybe higher. I want it to be stable at 50k with no big performance hits but then start to slow down as it gets to 100-150k

I always thought I was fill rate limited hmmm…

Anyways ty for the help. Very much appreciated.

StumpyStrust · September 16, 2012, 1:32am

I keep getting null pointers with this

	public static ByteBuffer bufferData(int id, ByteBuffer buff) {
		  if (GLContext.getCapabilities().GL_ARB_vertex_buffer_object) {
			  Util.checkGLError();
		    ARBVertexBufferObject.glBindBufferARB(ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, id);
		    Util.checkGLError();
		    return ARBVertexBufferObject.glMapBufferARB(GL15.GL_ARRAY_BUFFER, GL15.GL_DYNAMIC_DRAW, buff);
		  }
		  return null;
		}

The problem with all of this is that any googling gives non-lwjgl opengl which has a different api then lwjgl and lwjgl as no examples that I can find.

ra4king · September 16, 2012, 2:06am

Where exactly do you get a NullPointerException?

StumpyStrust · September 16, 2012, 5:04am

I get it on the return statement

Weird, because using the exact same code in my first opengl particle thing, which that is just fixed function, works.

I think I need to scrap this whole thing and restart as I must be doing something horribly wrong.

ra4king · September 16, 2012, 6:21am

null is returned because that extension isn’t supported then…?

StumpyStrust · September 16, 2012, 7:31am

No that is what the if check does.

It works because as I said on a different project is gave me the bytebuffer.

According to the docs, it will return null if it cant find a space for the buffer I think.

Hehe this is all after hundereds of unexplained crashes and VM crashs. You know “send error to windoes?” and “VM crash an error log was created”

I guess this really is not as trivial as all the tutorial make it out to be.

theagentd · September 16, 2012, 1:03pm

glMapBuffer() allows you to map parts of the buffer’s data to a ByteBuffer which it returns. However, to be able to map it you also need to create the VBO’s data buffer. Basically, you need a glBufferData(int target, long data_size, int usage) call to initialize the buffer’s data to the given capacity, and then map it. It returns null because there’s nothing to map yet.

There’s no problem with recreating the buffer’s data each frame. It’s actually preferred in most case, since mapping the buffer may cause a stall if the old data is still needed somewhere. Recreating the buffer’s data each frame makes sure that can’t happen, since the driver can just store the old data until it’s no longer needed.

EDIT: Here’s a test program to demonstrate VBOs with glMapBuffer(): http://www.java-gaming.org/?action=pastebin&id=257

StumpyStrust · September 18, 2012, 6:20am

Sadly, the test program just shows some really messed up screen of random colors. Is that what is suppose to happen?

And I switched to mapBuffers with no performance gain by following that code. ;/

princec · September 18, 2012, 8:11am

You won’t see a perf. gain from glMapBuffer() until you’re doing rather a lot of rendering and state changing. FWIW it took me about 60 hours to get this all working right You might want to take a look at the DefaultSpriteRenderer class in the Revenge of the Titans source code (and a bit of a mosey around in general in there). I think I’ve nailed the absolute optimum set-up there in so far as asynchronous rendering and Java computation goes. The only improvement I could make is to utilise geometry shaders and move most of that Java code into the driver but that excludes most systems in the wild.

Cas

StumpyStrust · September 18, 2012, 8:34am

What is your current performance.

from 1k sprites to 10k, 25k, 50k 100k, 500k.

Want to know if I am asking for too much.

princec · September 18, 2012, 8:44am

Totally varies depending on the sprites you’re drawing, the fillrate of the card, and the power of your CPU.

I can draw just a 20,000 sprites at 60fps on 1920x1200 if they’re all transparent and moderately large (say 64x64 ish). I say draw but I’m also including animation logic that’s running in my little sprite benchmark. Whereas if the sprites are all generally quite small like in Revenge of the Titans and some of them are opaque I can draw few thousand more. I’m on a 2.6Ghz i7 with Nvidia GTX 280 running Java 7 server with 2-tiered compilation.

Make no mistake, a couple of thousand sprites in a single video frame is a lot of sprites!

Sorry - got my numbers screwy - should be factor of 10 more sprites!

Cas

matheus23 · September 19, 2012, 8:19pm

princec · September 19, 2012, 9:04pm

Hm?

Cas

StumpyStrust · September 19, 2012, 10:17pm

200k??? O.o I keep thinking I am doing something wrong.

Libgdx’s spritebatcher on my comp is about 10-13 fps fast then my spritebatcher.

Libgdx gets 42-44 fps on my comp at 40k alpha blended 14 pixel sprites. Everything the same and I get 30-31

i5 2.5gh with the turbo boost like 2.7-2.8 and well…meh laptop only uses the integrated chip ever.

On my desk top which is quad core 2.6 and geforce250 I get maybe 10k more sprites so I think I need to use more gpu.

I guess I should not be complaining about performance but still… >:(

Edit:
hehe java 2d using opaque sprites and volatile images, I get 20-22 fps at 40k sprites.

princec · September 19, 2012, 10:47pm

Bah my edit is confusing No I mean, 20k sprites. 20k is good for 60fps on good hardware.

Cas