OpenGL lightning fast (managed) VBO mapping

I don’t want to bring unrealistic expectations. I more than doubled/tripled the speed of the sprite sorter in the SpriteEngine by packing multiple ordering-arrays into one/two, and feeding that into the radix sort, after having been unsuccessfully fidding with the rendering code for hours. It has yet to be back-ported to SPGL1.

It’s something I’ll probably dodge with Titan but I had to remove all the function calls from the Droid Assault demo - saved about 1.5 milliseconds IIRC (but I was only joking).

PS - nice work with the LOS shader too.

Hi

Maybe my question is silly… Why do you use glMapBuffer instead of glBufferSubData? Thank you for this code. I didn’t even know glMapBufferRange before reading your source code. Is your example useful for static VBOs?

The graphics driver als has to do verification for glBufferData(…) and glBufferSubData(…).

I get ~780fps with glBufferSubData( )
I get ~740fps with glBufferData( )

Quite good, but nowhere as good at 1450fps.


import static org.lwjgl.opengl.GL11.*;
import static org.lwjgl.opengl.GL15.*;

import java.nio.*;

import org.lwjgl.*;
import org.lwjgl.opengl.*;

public class MappedVertexBufferObjectTest {
	private static float packRGBA(int r, int g, int b, int a) {
		return Float.intBitsToFloat((r << 0) | (g << 8) | (b << 16) | (a << 24));
	}

	public static void main(String[] main) throws LWJGLException {
		Display.setDisplayMode(new DisplayMode(800, 600));
		Display.create();

		{
			glMatrixMode(GL_PROJECTION);
			glLoadIdentity();
			glOrtho(0, 800, 600, 0, -1, +1);

			glMatrixMode(GL_MODELVIEW);
			glLoadIdentity();
		}

		final int VA = 1;
		final int VBO_DATA = 2;
		final int VBO_SUBDATA = 3;
		final int VBO_MAPPED = 4;

@@		int renderStrategy = VBO_SUBDATA;

		boolean isOrphaning = false;
		boolean isUnsynchronized = true;
		MappedVertexBufferObjectProvider provider;
		provider = new MappedVertexBufferObjectProvider(GL_ARRAY_BUFFER, GL_STREAM_DRAW, isUnsynchronized);

		glEnableClientState(GL_VERTEX_ARRAY);
		glEnableClientState(GL_COLOR_ARRAY);

		int stride = (2 + 1) << 2;
		{
			// round up to multiple of 16 (for SIMD)
			stride += 16 - 1;
			stride /= 16;
			stride *= 16;
		}

		int strideFloat = stride >> 2;

		int drawCalls = 64;
		int trisPerDrawCall = 4;

		long lastSecond = System.nanoTime();
		int frameCount = 0;

		ByteBuffer bb = BufferUtils.createByteBuffer(trisPerDrawCall * 3 * stride);
		FloatBuffer fb = bb.asFloatBuffer();
		bb.position(2 << 2);

		while (!Display.isCloseRequested()) {
			provider.nextFrame();

			glClearColor(0, 0, 0, 1);
			glClear(GL_COLOR_BUFFER_BIT);

			for (int x = 0; x < drawCalls; x++) {
				MappedVertexBufferObject vbo = null;
				if (renderStrategy == VBO_DATA || renderStrategy == VBO_SUBDATA || renderStrategy == VBO_MAPPED) {
					vbo = provider.nextVBO();
					vbo.ensureSize(trisPerDrawCall * 3 * stride);

@@					glVertexPointer(2, GL_FLOAT, stride, 0 << 2);
@@					glColorPointer(4, GL_UNSIGNED_BYTE, stride, 2 << 2);

					if (renderStrategy == VBO_MAPPED) {
						fb = vbo.map().asFloatBuffer();
					}
				}
				if (renderStrategy != VBO_MAPPED) {
					fb.clear();
				}

				for (int y = 0, i = 0; y < trisPerDrawCall; y++) {
					fb.position((i++) * strideFloat);
					fb.put(x * 3 + 16).put(y * 3 + 16);
					fb.put(packRGBA(0xFF, 0x00, 0x00, 0xFF));

					fb.position((i++) * strideFloat);
					fb.put(x * 3 + 32).put(y * 3 + 16);
					fb.put(packRGBA(0x00, 0xFF, 0x00, 0xFF));

					fb.position((i++) * strideFloat);
					fb.put(x * 3 + 16).put(y * 3 + 32);
					fb.put(packRGBA(0x00, 0x00, 0xFF, 0xFF));
				}

				if (renderStrategy == VBO_MAPPED) {
					vbo.unmap();
				} else {
					fb.flip();

					if (renderStrategy == VBO_DATA) {
@@						glBufferData(GL_ARRAY_BUFFER, fb, GL_STREAM_DRAW);
					}
					if (renderStrategy == VBO_SUBDATA) {
@@						glBufferSubData(GL_ARRAY_BUFFER, 0, fb);
					}
					if (renderStrategy == VA) {
@@						glVertexPointer(2, stride, fb);
@@						glColorPointer(4, true, stride, bb);
					}
				}

				//	

				glDrawArrays(GL_TRIANGLES, 0, 3 * trisPerDrawCall);

				if (renderStrategy == VBO_DATA || renderStrategy == VBO_SUBDATA || renderStrategy == VBO_MAPPED) {
					if (isOrphaning) {
						vbo.orphan();
					}
				}
			}
			//

			Display.update();

			frameCount++;
			if (System.nanoTime() > lastSecond + 1_000_000_000L) {
				lastSecond += 1_000_000_000L;
				Display.setTitle(frameCount + "fps / " + (1000.0f / frameCount) + "ms");
				frameCount = 0;
			}
		}

		Display.destroy();
	}
}


This is cool, thanks for sharing.
I notice that it uses GL1.5 and 1.1. Is that because you are trying to maintain backwards compatibility with older video cards? Would you use these methods with more recent video cards?

glEnableClientState and glVertexPointer are indeed deprecated, but to use glVertexAttribPointer, I’m forced to use shaders. Not that there’s anything wrong with that, but it would make this code dump rather verbose.

  1. Putting data into the VBO and sending a batch of draw-calls later:

VSYNC                                            | <-- GPU starts late in frame   VSNC
| [               generate data                 ][        draw data          ]    |
  1. Putting data into the different VBOs, sending draw-calls immediately:

VSYNC         | <-- GPU        | <-- GPU        | <-- GPU                         VSNC
| [ gen.data ]     [ gen.data ]     [ gen.data ]                                  |
              [ draw data ]    [ draw data ]    [ draw data ]                      

So… reduce the odds that you drop a frame by sending the GPU the draw calls (too) late in the vsync time slot.

By default, the Nvidia driver will buffer up to 3 frames. The whole point of pipelining rendering is to avoid dropping a frame. I seriously doubt that using a single VBO will affect the FPS or cause a frame drop. The actual frame won’t be ready until you do Display.update(). If you were to do an extreme number of OpenGL commands, the command buffer might get full and the command blocks until there’s space in the command buffer. If these commands are very simple, the command buffer will be drained extremely quickly and the GPU will stall. That’s called a CPU bottleneck and is completely unrelated to the problem you’re trying to solve (glMapBuffer() performance).

I know some of the theory too, where’s the numbers? =S

Rendering 8K trees (5 tris each), about 50x50 fragments.

[tr]
[td]data[/td]
[td]batches[/td]
[td]framerate[/td]
[/tr]
[tr]
[td]glBufferData()[/td]
[td]1x glDrawArrays[/td]
[td]57fps[/td]
[/tr]
[tr]
[td]glBufferData()[/td]
[td]16x glDrawArrays[/td]
[td]53fps[/td]
[/tr]
[tr]
[td]glMapBuffer()[/td]
[td]1x glDrawArrays[/td]
[td]52fps[/td]
[/tr]
[tr]
[td]glMapBuffer()[/td]
[td]16x glDrawArrays[/td]
[td]40fps[/td]
[/tr]
[tr]
[td]glMapBufferRange(UNSYNC)[/td]
[td]1x glDrawArrays[/td]
[td]61fps[/td]
[/tr]
[tr]
[td]glMapBufferRange(UNSYNC)[/td]
[td]16x glDrawArrays[/td]
[td]64fps[/td]
[/tr]

Thank you. I very much approve of what you’re doing then. Please proceed immediately so Cas can make more awesome games! =D Of course you deserve a medal!