OpenGL lightning fast (managed) VBO mapping

For those not interested in an elaborate background story, I’ll sum up the functionality of the code below:

  • It uses unsynchronized mapped VBOs
  • To make this guaranteed to be safe, it has 6 VBOs (worst case - see below)
  • For every frame, it picks the next VBO (round robin)
  • When it reuses a VBO, it is so ‘old’ (6 frames old) that it is guaranteed to be no longer in use by the GPU

Here is a straightforward code dump:


import java.nio.ByteBuffer;

import org.lwjgl.opengl.ARBMapBufferRange;
import org.lwjgl.opengl.GL15;
import org.lwjgl.opengl.GL30;
import org.lwjgl.opengl.GLContext;

import static org.lwjgl.opengl.GL15.*;

public class Unsync {
	// triple buffering in stereo mode is rather rare through..
	private static final int MAX_FRAMEBUFFER_COUNT = 2 * 3;

	private final int glTarget, glUsage;
	private final int[] bufferHandles, requestedSizes, allocatedSizes;
	private int currentBufferIndex;

	public Unsync(int glTarget, int glUsage) {
		this.glTarget = glTarget; // GL_ARRAY_BUFFER, GL_ELEMENT_ARRAY_BUFFER
		this.glUsage = glUsage; // GL_STATIC_DRAW, GL_STREAM_DRAW

		requestedSizes = new int[MAX_FRAMEBUFFER_COUNT];
		allocatedSizes = new int[MAX_FRAMEBUFFER_COUNT];

		bufferHandles = new int[MAX_FRAMEBUFFER_COUNT];
		for (int i = 0; i < this.bufferHandles.length; i++) {
			bufferHandles[i] = glGenBuffers();
		}

		currentBufferIndex = -1;
	}

	public void nextFrame() {
		currentBufferIndex = (currentBufferIndex + 1) % MAX_FRAMEBUFFER_COUNT;
	}

	public void bind() {
		glBindBuffer(glTarget, currentBufferHandle());
	}

	public int currentBufferHandle() {
		return bufferHandles[currentBufferIndex];
	}

	public void ensureSize(int size) {
		assert size > 0;

		requestedSizes[currentBufferIndex] = size;
		if (size > allocatedSizes[currentBufferIndex]) {
			glBufferData(glTarget, size, glUsage);
			allocatedSizes[currentBufferIndex] = size;
		}
	}

	public void trimToSize() {
		if (requestedSizes[currentBufferIndex] != allocatedSizes[currentBufferIndex]) {
			glBufferData(glTarget, requestedSizes[currentBufferIndex], glUsage);
			allocatedSizes[currentBufferIndex] = requestedSizes[currentBufferIndex];
		}
	}

	public ByteBuffer map() {
		long offset = 0;
		long length = requestedSizes[currentBufferIndex];

		if (GLContext.getCapabilities().OpenGL30) {
			int flags = GL30.GL_MAP_WRITE_BIT | GL30.GL_MAP_UNSYNCHRONIZED_BIT;
			return GL30.glMapBufferRange(glTarget, offset, length, flags, null);
		}

		if (GLContext.getCapabilities().GL_ARB_map_buffer_range) {
			int flags = ARBMapBufferRange.GL_MAP_WRITE_BIT | ARBMapBufferRange.GL_MAP_UNSYNCHRONIZED_BIT;
			return ARBMapBufferRange.glMapBufferRange(glTarget, offset, length, flags, null);
		}

		return GL15.glMapBuffer(glTarget, GL15.GL_WRITE_ONLY, null);
	}

	public void unmap() {
		glUnmapBuffer(glTarget);
	}

	public void deleteAll() {
		for (int i = 0; i < this.bufferHandles.length; i++) {
			glDeleteBuffers(this.bufferHandles[i]);
			this.bufferHandles[i] = -1;
		}
	}
}


Unsync vbo = new Unsync(...);
while(true) { // render loop
	vbo.nextFrame(); // mandatory


	vbo.bind();
	vbo.ensureSize(bytesInVBO);
	ByteBuffer mapped = vbo.map();
	// fill it
	vbo.unmap();

	// render things

	// swap buffers
}




OpenGL VBO performance is incredibly hard to optimize. After you managed to fill the VBO data as fast as possible, you’re pretty much relying on the performance of glMapBuffer(…) and glUnmapBuffer() to pump the data over to the graphics card.

It turns out that these calls have a significant overhead, because the driver has to verify that the memory block it is about to return, is not currently in use by the GPU. Especially when doing many of these calls per frame for small batches of geometry, you’ll see it drags the framerate under 60Hz quickly: on my particular (low end) graphics card, I render 64x 4 tiny triangles per frame and end up with an abysmal 45fps.


@@      // what not to do...
      while (!Display.isCloseRequested()) {
         glClearColor(0, 0, 0, 1);
         glClear(GL_COLOR_BUFFER_BIT);

         for (int x = 0; x < drawCalls; x++) {
            
            glVertexPointer(2, GL_FLOAT, stride, 0 << 2);
            glColorPointer(4, GL_UNSIGNED_BYTE, stride, 2 << 2);

@@            FloatBuffer fb = glMapBuffer(...).asFloatBuffer();
            for (int y = 0, i = 0; y < trisPerDrawCall; y++) {
               fb.position((i++) * floatStride);
               fb.put(x * 3 + 16).put(y * 3 + 16);
               fb.put(packRGBA(0xFF, 0x00, 0x00, 0xFF));

               fb.position((i++) * floatStride);
               fb.put(x * 3 + 32).put(y * 3 + 16);
               fb.put(packRGBA(0x00, 0xFF, 0x00, 0xFF));

               fb.position((i++) * floatStride);
               fb.put(x * 3 + 16).put(y * 3 + 32);
               fb.put(packRGBA(0x00, 0x00, 0xFF, 0xFF));
            }
@@            glUnmapBuffer();

            glDrawArrays(GL_TRIANGLES, 0, 3 * trisPerDrawCall);
         }
         //

         Display.update();
      }

Taking full advantage of the performance of the GPU, means we have to keep the driver from doing these costly verifications. At first I thought that guaranteeing that the VBOs are not used in rendering anymore, by using a pool of VBOs would be enough, but the driver simply has to perform a lot of checks to prove what the application already knows. We somehow have to make the driver trust our input and disable any check.

Fortunately, we have [icode]glMapBufferRange( … | GL_MAP_UNSYNCHRONIZED_BIT)[/icode] to do exactly that! But it leaves us with a problem, now we have to ensure that all VBO mapping we do is done on memory guaranteed not to be in use by the GPU. As the GPU is fully asynchronous, that’s no easy feat.

But lets first check performance by simply using 1 VBO and using glMapBufferRange(…) instead of glMapBuffer(…). I got ~1450fps, that’s an improvement of over factor 32! Awesome! The down side is that the rendering, as the specs say are ‘undefined’, and I indeed see a lot of garbled renderings on the framebuffer.


@@         // what not to do either...
            glVertexPointer(2, GL_FLOAT, stride, 0 << 2);
            glColorPointer(4, GL_UNSIGNED_BYTE, stride, 2 << 2);

-           FloatBuffer fb = glMapBuffer(..., GL_WRITE_ONLY, ...).asFloatBuffer();
+           FloatBuffer fb = glMapBufferRange(..., GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT, ...).asFloatBuffer();
            for (int y = 0, i = 0; y < trisPerDrawCall; y++) {
               fb.position((i++) * floatStride);

After a bit of messing around, I found a way to guarantee that the VBOs we map, are not in use anymore by the GPU. We can assume that on a (common) double buffered framebuffer, we know that 1 frame is being rendered into, and the other frame is displayed. On a triple buffered setup, 2 frames are being rendered into, and the last frame is displayed. On a stereo triple buffered setup, 4 frames are being rendered into, and the last pair of frames are displayed. This means that there can be up to 6 frames active in any game!

So if we create an array (of length 6) of lists of VBOs, we can pick a list of VBOs each frame, that has been used 6 frames ago, and therefore guanteed to be not used in any rendering. For every frame, we reuse and/or allocate as much VBOs as we need, trusting that we will encounter these VBOs again after 6 frames.

It wouldn’t be a ‘shared code’ post, if I wouldn’t dump the full code, so you can take advantage of the performance boost of doing to verification work of the driver yourself:


import static org.lwjgl.opengl.GL11.*;
import static org.lwjgl.opengl.GL15.*;

import java.nio.*;

import org.lwjgl.*;
import org.lwjgl.opengl.*;

public class MappedVBOTest {
	private static float packRGBA(int r, int g, int b, int a) {
		return Float.intBitsToFloat((r << 0) | (g << 8) | (b << 16) | (a << 24));
	}

	public static void main(String[] main) throws LWJGLException {
		Display.setDisplayMode(new DisplayMode(800, 600));
		Display.create();

		{
			glMatrixMode(GL_PROJECTION);
			glLoadIdentity();
			glOrtho(0, 800, 600, 0, -1, +1);

			glMatrixMode(GL_MODELVIEW);
			glLoadIdentity();
		}

@@		boolean isUnsynchronized = true;
		MappedVertexBufferObjectProvider provider;
		provider = new MappedVertexBufferObjectProvider(GL_ARRAY_BUFFER, GL_STATIC_DRAW, isUnsynchronized);

		glEnableClientState(GL_VERTEX_ARRAY);
		glEnableClientState(GL_COLOR_ARRAY);

		int stride = (2 + 1) << 2;
		{
			// round up to multiple of 16 (for SIMD)
			stride += 16 - 1;
			stride /= 16;
			stride *= 16;
		}

		int strideFloat = stride >> 2;

		int drawCalls = 64;
		int trisPerDrawCall = 4;

		long lastSecond = System.nanoTime();
		int frameCount = 0;

		while (!Display.isCloseRequested()) {
@@			provider.nextFrame();

			glClearColor(0, 0, 0, 1);
			glClear(GL_COLOR_BUFFER_BIT);

			for (int x = 0; x < drawCalls; x++) {
@@				MappedVertexBufferObject vbo = provider.nextVBO();
				vbo.ensureSize(trisPerDrawCall * 3 * stride);

				glVertexPointer(2, GL_FLOAT, stride, 0 << 2);
				glColorPointer(4, GL_UNSIGNED_BYTE, stride, 2 << 2);

				FloatBuffer fb = vbo.map().asFloatBuffer();
				for (int y = 0, i = 0; y < trisPerDrawCall; y++) {
					fb.position((i++) * strideFloat);
					fb.put(x * 3 + 16).put(y * 3 + 16);
					fb.put(packRGBA(0xFF, 0x00, 0x00, 0xFF));

					fb.position((i++) * strideFloat);
					fb.put(x * 3 + 32).put(y * 3 + 16);
					fb.put(packRGBA(0x00, 0xFF, 0x00, 0xFF));

					fb.position((i++) * strideFloat);
					fb.put(x * 3 + 16).put(y * 3 + 32);
					fb.put(packRGBA(0x00, 0x00, 0xFF, 0xFF));
				}
				vbo.unmap();

				glDrawArrays(GL_TRIANGLES, 0, 3 * trisPerDrawCall);
			}
			//

			Display.update();

			frameCount++;
			if (System.nanoTime() > lastSecond + 1_000_000_000L) {
				lastSecond += 1_000_000_000L;
				Display.setTitle(frameCount + "fps / " + (1000.0f / frameCount) + "ms");
				frameCount = 0;
			}
		}

		Display.destroy();
	}
}

Set [icode]isUnsynchronized = false[/icode], and you’ll see a framedrop of anything in the realm of factor 30 to 60 (!).

AMD Radeon 5500: 1450fps vs 45fps
AMD Radeon 5870: 5250fps vs 88fps


import java.util.*;

public class MappedVertexBufferObjectProvider {
	// triple buffering in stereo mode is rather rare through..
	private static final int MAX_WINDOW_BUFFER_COUNT = 2 * 3;

	private final int glTarget;
	private final int glUsage;
	private final boolean unsync;

	@SuppressWarnings("unchecked")
	public MappedVertexBufferObjectProvider(int glTarget, int glUsage, boolean unsync) {
		this.glTarget = glTarget; // GL_ARRAY_BUFFER, GL_ELEMENT_ARRAY_BUFFER
		this.glUsage = glUsage; // GL_STATIC_DRAW, GL_STREAM_DRAW
		this.unsync = unsync;

		frameToBufferObjects = new ArrayList[MAX_WINDOW_BUFFER_COUNT];
		for (int i = 0; i < frameToBufferObjects.length; i++) {
			frameToBufferObjects[i] = new ArrayList<>();
		}
	}

	final List<MappedVertexBufferObject>[] frameToBufferObjects;

	private int frameIndex = -1;
	private int vboIndex = -1;

	public void nextFrame() {
		frameIndex += 1;
		frameIndex %= frameToBufferObjects.length;

		vboIndex = -1;
	}

	public MappedVertexBufferObject nextVBO() {
		if (frameIndex == -1) {
			throw new IllegalStateException("not in a frame");
		}
		vboIndex += 1;

		List<MappedVertexBufferObject> vbos = frameToBufferObjects[frameIndex];
		if (vboIndex == vbos.size()) {
			vbos.add(new MappedVertexBufferObject(glTarget, glUsage, unsync));
		}

		MappedVertexBufferObject object = vbos.get(vboIndex);
		object.bind();
		return object;
	}

	public void orphanAll() {
		for (List<MappedVertexBufferObject> vbos : frameToBufferObjects) {
			for (MappedVertexBufferObject object : vbos) {
				object.orphan();
			}
		}
	}

	public void trimAllToSize() {
		for (List<MappedVertexBufferObject> vbos : frameToBufferObjects) {
			for (MappedVertexBufferObject object : vbos) {
				object.trimToSize();
			}
		}
	}

	public void delete() {
		for (List<MappedVertexBufferObject> vbos : frameToBufferObjects) {
			for (MappedVertexBufferObject object : vbos) {
				object.delete();
			}
		}
	}

	@Override
	public String toString() {
		int[] vboCounts = new int[frameToBufferObjects.length];
		for (int i = 0; i < vboCounts.length; i++) {
			vboCounts[i] = frameToBufferObjects[i].size();
		}
		return this.getClass().getSimpleName() + "[" + Arrays.toString(vboCounts) + "]";
	}
}


import static org.lwjgl.opengl.GL15.*;

import java.nio.*;

import org.lwjgl.opengl.*;

public class MappedVertexBufferObject {
	private final int glTarget, glUsage;
	private final int handle;
	private int requestedSize, allocatedSize;
	private boolean isMapped;
	private final boolean unsync;

	private static MappedVertexBufferObject bound;

	public MappedVertexBufferObject(int glTarget, int glUsage, boolean unsync) {
		this.glTarget = glTarget;
		this.glUsage = glUsage;
		this.unsync = unsync;
		this.handle = glGenBuffers();
	}

	public void bind() {
		if (bound == this) {
			throw new IllegalStateException("already bound");
		}
		bound = this;

		glBindBuffer(glTarget, this.handle);
	}

	public void ensureSize(int size) {
		assert size > 0;
		if (bound != this) {
			throw new IllegalStateException("not bound");
		}

		requestedSize = size;
		if (size > allocatedSize) {
			glBufferData(glTarget, size, glUsage);
			allocatedSize = size;
		}
	}

	public void trimToSize() {
		if (bound != this) {
			throw new IllegalStateException("not bound");
		}

		if (requestedSize != allocatedSize) {
			glBufferData(glTarget, requestedSize, glUsage);
			allocatedSize = requestedSize;
		}
	}

	public void orphan() {
		if (bound != this) {
			throw new IllegalStateException("not bound");
		}

		glBufferData(glTarget, 0, glUsage);
		allocatedSize = requestedSize = 0;
	}

	public ByteBuffer map() {
		if (bound != this) {
			throw new IllegalStateException("not bound");
		}
		if (requestedSize == 0) {
			throw new IllegalStateException("no data");
		}
		if (isMapped) {
			throw new IllegalStateException("already mapped");
		}
		isMapped = true;

		long offset = 0;
		long length = requestedSize;

		if (GLContext.getCapabilities().OpenGL30) {
			int access = GL30.GL_MAP_WRITE_BIT;
			if (unsync) {
				access |= GL30.GL_MAP_UNSYNCHRONIZED_BIT;
			}
			return GL30.glMapBufferRange(glTarget, offset, length, access, null);
		}

		if (GLContext.getCapabilities().GL_ARB_map_buffer_range) {
			int access = ARBMapBufferRange.GL_MAP_WRITE_BIT;
			if (unsync) {
				access |= ARBMapBufferRange.GL_MAP_UNSYNCHRONIZED_BIT;
			}
			return ARBMapBufferRange.glMapBufferRange(glTarget, offset, length, access, null);
		}

		int access = GL_WRITE_ONLY;
		return glMapBuffer(glTarget, access, null);
	}

	public void unmap() {
		if (bound != this) {
			throw new IllegalStateException("not bound");
		}
		if (!isMapped) {
			throw new IllegalStateException("not mapped");
		}
		isMapped = false;

		glUnmapBuffer(glTarget);
	}

	public void delete() {
		if (bound == this) {
			throw new IllegalStateException("still bound");
		}
		if (isMapped) {
			throw new IllegalStateException("still mapped");
		}

		glDeleteBuffers(handle);
	}

	public static void unbind(int glTarget) {
		if (bound == null) {
			throw new IllegalStateException("none bound");
		}

		glBindBuffer(glTarget, 0);
		bound = null;
	}
}

I wouldn’t recommend that. The memory usage will skyrocket for heavy geometry. What’s wrong with just dumping all the data into a single large VBO instead?

Sure, it trades verification overhead for vRAM. I thought that was rather obvious…?

Another problem with streaming data using 1 large VBO is that your CPU is generating the data for long periods, at which time the GPU will not receive any instructions. So eventually the GPU will be idling, waiting for the glUnmapBuffer() or the next glDrawElements(…) call. It’s best to chunk the data you send to the GPU to keep both the CPU and GPU busy at all times. The code I provided helps tremendously. Keep in mind that the usual glMapBuffer(…) call, has such a big overhead that even with only 64 calls per frame, you are likely to get under 100fps on the latest hardware, even when vbo-toggling / pooling.

I’d be interested to see how it compares to vertex arrays. :slight_smile:

Rendering 16 small triangles, 64x per frame

[tr]
[td]VA[/td][td]gl*Pointer(*Buffer)[/td][td]880fps[/td]
[/tr]
[tr]
[td]VBO[/td][td]glMapBuffer(WRITE)[/td][td]45fps[/td]
[/tr]
[tr]
[td]VBO[/td][td]glMapBufferRange(WRITE)[/td][td]45fps[/td]
[/tr]
[tr]
[td]VBO[/td][td]glMapBufferRange(WRITE | INVALIDATE_BUFFER)[/td][td]86fps[/td]
[/tr]
[tr]
[td]VBO[/td][td]glMapBufferRange(WRITE | INVALIDATE_BUFFER) + orphaning[/td][td]135fps[/td]
[/tr]
[tr]
[td]VBO[/td][td]glMapBufferRange(WRITE | UNSYNCHRONIZED)[/td][td]1450fps[/td]
[/tr]
[tr]
[td]VBO[/td][td]glMapBufferRange(WRITE | UNSYNCHRONIZED) + orphaning[/td][td]1310fps[/td]
[/tr]
[tr]
[td]VBO[/td][td]glMapBufferRange()[/td][td]780fps[/td]
[/tr]
[tr]
[td]VBO[/td][td]glBufferData()[/td][td]740fps[/td]
[/tr]

I remember reading that you could effectively detach data from a VBO by calling glBufferData(target, null). The GPU would continue using the previous block of data to complete rendering, but any data manipulation commands would use the new block assigned with the glBufferData call.

Have you tried this at all?

Yup, it’s called orphaning, and isn’t remotely as fast as UNSYNC. I’d have to benchmark it for absolutele numbers though.

Update
Added orphaning performance to stats in previous post.

Also, have you seen how Cas’ sprite renderer works? It uses multiple VBOs too.

Yeah I know all about it, I’ve been optimizing the SpriteEngine.

I doubt that your GPU will starve for unless you freeze up for more than a frame or have some kind of synchronization (clarification: read back data from the GPU). Are you sure this is really a problem? You can check GPU usage with GPU-Z.

Whether it’s a problem, depends on your usage of OpenGL. It’s very rarely not a problem. In tech-demos it’s not likely to be a problem. In, say, your particle engine code, the bottleneck is clearly the GPU, so you just feed it some data (if any) and it will chug along. In a real game, that is a lot more complex than that, a bunch of geometry (between state changes) is much less likely to keep the GPU active for too long, so you try to not batch up your uploads to keep the GPU busy while you’re calculating/generating more dynamic data.

o_O Are you really sure? I’m not questioning using multiple VBOs since there’s clear evidence that mapping a buffer is costly, just your statement that mapping a buffer for a long time causes stalling. For example, in my CPU-based particle system I map almost 12MBs of data each frame. I find it hard to believe that I would gain performance (it’s CPU limited) by splitting up this uploading and drawing it in batches.

If you do 1 glMapBuffer(…) per frame, then there is nothing (well, little*) to be gained. As I said earlier, your tech dmeo is simple, it doesn’t have state changes, you just shove geometry to the GPU, and let it work on it. If you have N different types of particles, that all have their own state, and you need depth-sorting at the same time… things become vastely more complex.

  • If I vastely simplify the numbers, I get 64 glMapBuffer(…) calls at 45fps, which means that each glMapBuffer call has an overhead of 0.38ms, which can be reduced by factor 30-60, by using glMapBufferRange(…, UNSYNC, …)

glMapBuffer(…) @45fps -> GPU-Z says 41% GPU load
glMapBufferRange(…) @1450fps -> GPU-Z says 99% GPU load

Clearly GPU-Z’s GPU Load statistic is not a good indicator of GPU effciency (maybe it too is in a busy loop, waiting for data from the CPU), as with a load of 41%, you’d expect to get ~600fps, not 45fps.

Yes, but what’s stopping me from just putting all those particles into a single VBO and drawing parts of that VBO with multiple draw calls, just like Cas’ sprite renderer is doing now? The actual uploading would be exactly the same as my particle test, but the draw calls would be different. Wouldn’t that make the performance of glMapBuffer() irrelevant since it’s only called once per frame? The draw calls should stay the same, but you’ll also have to switch VBOs between each draw call with your approach.

A GPU is way too complex to measure load accurately, but the relative load can be very informative. If GPU load isn’t close to 100%, you have a bottleneck somewhere else. It could be a VRAM bandwidth bottleneck, a CPU bottleneck or one of lots of different other things. The only thing it tells you is that the stream processors/CUDA cores aren’t processing at 100% capacity.

EDIT: I’m also very interested in the actual results on the sprite rendering engine too since I once tried to optimize it too.

In memory limited situations is vertex arrays best choise then?

It depends. But for dynamic content, it’s a quick and dirty solution.

Very minor point but my personal opinion would be to convert all the contract checks into asserts as this isn’t targeting general purpose black-box usage.

Can you optimise it for the dodgy jitter on the XBox while you’re at it as well please :slight_smile: