Immediate mode rendering is dead

Long live VBOs!

So, having more or less doubled the speed of my sprite engine merely by switching to VBO based rendering instead of using traditional heap based DirectByteBuffers, I find now that as soon as I perform immediate mode rendering in OpenGL, performance rapidly plummets back down to terrible.

It would appear that VBOs are kind of an all-or-nothing approach; any immediate mode rendering pretty much buggers the whole advantage of VBO usage up. So now I have to port all my text rendering, background rendering, capacitor zap rendering, building under attack rendering, and powerup beam in effect rendering to use VBOs instead of immediate mode. Bah.

Cas :slight_smile:

Thanks for the tip. :slight_smile:

Does this stuff ever get publicly documented by driver writers?

Could you explain what you mean with a ‘heap based DirectByteBuffer’? A DirectByteBuffer has its pointer outside of the heap. Naturally, the DirectByteBuffer instance will be on the heap, but that’s also the case with the object returned from glMapBuffer()

glMapBuffer() will also stall the GPU until glUnmapBuffer() is called. So when you are filling your ByteBuffer, which might take some time, your GPU will be idling. No such problem with glBufferDataARB.

http://www.spec.org/gwpg/gpc.static/vbo_whitepaper.html

[quote=“Riven,post:4,topic:34805”]
I think he means a direct ByteBuffer that points to JVM-allocated memory, versus a direct ByteBuffer that points to driver-allocated memory (that’s returned from glMapBuffer), which may be used for faster GPU transfers.

[quote=“Riven,post:4,topic:34805”]
That’s implementation and usage dependent. It may stall or it may not stall. Cas may be lucky and his driver is clever enough to queue a copy instead of stalling the GPU. That’s why I suggested in his other topic that he should explore an algorithmic solution first (using shaders, instancing etc), before trying a simple switch to VBOs (which he should do anyway).

  1. Filling both has exactly the same performance.
  2. I wonder how the CPU=>GPU copy could be faster for the driver-allocated memory. It might be guaranteed to be page-aligned, but we can guarantee the same with a JVM-allocated ByteBuffer. The check for alignment should be negligible compared to the data copy.

Wouldn’t a copy (malloc) of the driver be much slower than providing your own ByteBuffer?

Any benchmarks to share?

Heap-based direct byte buffer is that which is returned by ByteBuffer.allocateDirect() - it’s still on the C heap (not the Java object heap). It’s possible to create direct bytebuffers outside of the C heap in native code but not pure Java code. This is how the old NV_fence and AGP RAM allocation used to work.

glMapBuffer() will return a pointer to an address in the process’s address space and not on the heap too, it’ll be some weirdy location provided by the driver. Now, if I were calling glMapBuffer() on a buffer that overlapped some bit of memory the driver was currently trying to render, then I’d cause a GPU stall most likely. However, because of they way I do rendering - I map, rapidly fill the buffer with data, unmap, and then start rendering from it - I’m unlikely to cause any GPU stalling. In fact if the driver’s worth its salt it’d be batching my state change calls and processing them asynchronously with the buffer DMA, effectively making nearly all the calls return immediately.

If I find that filling the byte buffer is taking too long I could double buffer my geometry data - that is, alternately swap between two VBOs. I might yet do this anyway.

Cas :slight_smile:

Not so; a driver-provided pointer to a strange address over the bus somewhere can completely bypass clientside memory caches and thus eliminate cache pollution, a major factor in slowdown when copying data up to the card. Even the piddly 125kb of vertex data I copy up per frame buggers most CPUs L1 caches.

At least that’s my understanding of it, and seeing as everything’s twice as fast since I made the change, I assume something good is happening :slight_smile:

Cas :slight_smile:

Again, this depends on the usage flags and how clever the driver is, but in my instance, I’m doing GL_STREAM_DRAW and GL_WRITE_ONLY, one of the optimally easy cases for allocation: the driver never needs to return the same pointer twice, or even more cleverly, it can provide the same clientside address space pointer, but pointing to a completely different serverside memory location, so that I can write to it unhindered.

Cas :slight_smile:

(I was writing this while Cas posted his 3 replies above, posting anyway)

[quote=“Riven,post:6,topic:34805”]
Yes. But as I said in the other topic, for the amount of data Cas is generating per-frame, filling the buffer shouldn’t be affecting the game’s performance in any significant way.

[quote=“Riven,post:6,topic:34805”]
I’m not sure, but I think the GPU cannot perform DMA transfers from arbitrary memory locations. For example, on AGP cards, you need to use memory that’s been reserved and allocated from the GART for the card to be able to use DMA. The JVM can’t do that for you, but glMapBuffer can. The GART memory linearization has been moved to the GPU/driver on PCIe cards, but I think the same issue remains.

[quote=“Riven,post:6,topic:34805”]
The GPU may have allocated a local copy of the VBO in GPU memory and do some kind of double buffering. Instead of stalling on glMapBuffer, it may continue rendering from the local copy, then update the VBO when rendering is done. It can’t do that with user-allocated memory, not without having 3 copies of the VBO data. For example:

  1. glBufferSubData is called, a user-allocated buffer is supplied.
  2. the driver copies the user-allocated buffer somewhere else in system memory (possibly DMAable).
  3. the system-memory => GPU-memory copy is performed to update the GPU-allocated VBO.

That’s 3 copies of the VBO data. The driver cannot avoid #2, because the user may modify the user-allocated buffer data before the GPU transfer is performed. In the glMapBuffer case, data modification is controlled, you can’t change anything outside a map/unmap pair or without a glBuffer(Sub)Data call. So, glMapBuffer gives you direct access to the #2 memory.

That’s all assuming the driver does the double buffer copy. If not, the GPU stalls and you have the same performance. Possibly slightly better with glMapBuffer in case what I wrote above about DMA isn’t bullshit. Or well, the data transfer shouldn’t be the bottleneck (very few data according to Cas, very high bandwidth available on modern cards), but stalling the GPU can be. glMapBuffer is supposed to help with avoiding such stalls (that’s why the WRITE_ONLY flag exists).

[quote=“Riven,post:6,topic:34805”]
No, sorry. I may be talking out of my ass here, it’s all based on random stuff I’ve read around and how I think GPUs/drivers work. I’ve said this before on the LWJGL forums, I haven’t used glMapBuffer more than once, I even replaced it at some point with a better solution. My other warning also applies, VBOs are very GPU/vendor/driver sensitive, you cannot be sure that a rendering setup that performs great on your machine will be anywhere close to optimal for other machines too. On the other hand, I haven’t done any VBO tests lately, drivers with VBO support have matured and things may be much better now.

Anyway, I’d still be interested in a comparison of map/unmap with glBufferSubData in the context of a sprite engine. Cas should be able to test this very quickly.

2048 tiny triangles (to take fillrate out of the equation)

glBufferDataARB => 50ms


                  for (int i = 0; i < 256; i++)
                  {
                     javaSideBuffer.clear();
                     glBufferDataARB(GL_ARRAY_BUFFER_ARB, javaSideBuffer, GL_STREAM_DRAW_ARB);

                     glVertexPointer(3, GL_FLOAT, 0, 0);
                     glColorPointer(3, GL_FLOAT, 0, byteCount >> 1);
                     glDrawArrays(GL_TRIANGLES, 0, quadCount * 3 * 2);
                  }

glBufferSubData() => 50ms


                  glBufferDataARB(GL_ARRAY_BUFFER_ARB, byteCount, GL_STREAM_DRAW_ARB);

                  for (int i = 0; i < 256; i++)
                  {
                     javaSideBuffer.clear();
                     glBufferSubDataARB(GL_ARRAY_BUFFER_ARB, 0, javaSideBuffer);

                     glVertexPointer(3, GL_FLOAT, 0, 0);
                     glColorPointer(3, GL_FLOAT, 0, byteCount >> 1);
                     glDrawArrays(GL_TRIANGLES, 0, quadCount * 3 * 2);
                  }

glMapBuffer() => 32ms


                  glBufferDataARB(GL_ARRAY_BUFFER_ARB, byteCount, GL_STREAM_DRAW_ARB);

                  for (int i = 0; i < 256; i++)
                  {
                     ByteBuffer driverSideBuffer;

                     driverSideBuffer = glMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_WRITE_ONLY_ARB, null);
                     javaSideBuffer.clear();
                     driverSideBuffer.put(javaSideBuffer);
                     glUnmapBufferARB(GL_ARRAY_BUFFER_ARB);

                     glVertexPointer(3, GL_FLOAT, 0, 0);
                     glColorPointer(3, GL_FLOAT, 0, byteCount >> 1);
                     glDrawArrays(GL_TRIANGLES, 0, quadCount * 3 * 2);
                  }

And the boring code to fill the vertex array:


            float quadSize = 5.0f;
            int quadRepeat = 32;
            int quadCount = quadRepeat * quadRepeat;
            int byteCount = quadCount
                  * 2 /* triangles per quad */
                  * 3 /* vertices per triangle */
                  * 3 /* coordinates per vertex */
                  * 2 /* vertex+color */
                  * 4 /* float_sizeof */;
            ByteBuffer javaSideBuffer = BufferUtils.createByteBuffer(byteCount);

            // vertices
            {
               for (int x = 0; x < quadRepeat; x++)
               {
                  for (int y = 0; y < quadRepeat; y++)
                  {
                     float x0 = (x + 1) * quadSize;
                     float y0 = (y + 1) * quadSize;
                     float x1 = (x + 2) * quadSize;
                     float y1 = (y + 2) * quadSize;

                     javaSideBuffer.putFloat(x0).putFloat(y0).putFloat(0.0f);
                     javaSideBuffer.putFloat(x1).putFloat(y0).putFloat(0.0f);
                     javaSideBuffer.putFloat(x1).putFloat(y1).putFloat(0.0f);
                     javaSideBuffer.putFloat(x1).putFloat(y1).putFloat(0.0f);
                     javaSideBuffer.putFloat(x0).putFloat(y1).putFloat(0.0f);
                     javaSideBuffer.putFloat(x0).putFloat(y0).putFloat(0.0f);
                  }
               }
            }

            // colors
            {
               for (int x = 0; x < quadRepeat; x++)
               {
                  for (int y = 0; y < quadRepeat; y++)
                  {
                     float[] cornerA = new float[] { 1, 0, 0 }; // red
                     float[] cornerB = new float[] { 0, 1, 0 }; // green
                     float[] cornerC = new float[] { 0, 0, 1 }; // blue
                     float[] cornerD = new float[] { 1, 1, 0 }; // yellow

                     for (float v : cornerA)
                        javaSideBuffer.putFloat(v);
                     for (float v : cornerB)
                        javaSideBuffer.putFloat(v);
                     for (float v : cornerC)
                        javaSideBuffer.putFloat(v);

                     for (float v : cornerC)
                        javaSideBuffer.putFloat(v);
                     for (float v : cornerD)
                        javaSideBuffer.putFloat(v);
                     for (float v : cornerA)
                        javaSideBuffer.putFloat(v);
                  }
               }
            }

Could you try reusing the driverSideBuffer (store the reference and pass as the last argument to glMapBuffer)? Also, could you download a fresh LWJGL build and try the new glMapBuffer API (with an explicit length argument)?

reusing glMapBuffer() => 32ms


                  glBufferDataARB(GL_ARRAY_BUFFER_ARB, byteCount, GL_STREAM_DRAW_ARB);

                  ByteBuffer driverSideBuffer = glMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_WRITE_ONLY_ARB, null);

                  for (int i = 0; i < 256; i++)
                  {
                     javaSideBuffer.clear();
                     driverSideBuffer.clear();
                     driverSideBuffer.put(javaSideBuffer);
                     glUnmapBufferARB(GL_ARRAY_BUFFER_ARB);

                     glVertexPointer(3, GL_FLOAT, 0, 0);
                     glColorPointer(3, GL_FLOAT, 0, byteCount >> 1);

                     glDrawArrays(GL_TRIANGLES, 0, quadCount * 3 * 2);

                     driverSideBuffer = glMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_WRITE_ONLY_ARB, driverSideBuffer);
                  }

See above.

I downloaded it, put it in the build path, and I couldn’t find a version of glMapBuffer with an explicit length parameter :persecutioncomplex:

Hrr, I hate the JVM… I had a version that used alternating buffers, took 27ms, and when I launched it again, it was back as 32ms :frowning:

[quote=“Riven,post:15,topic:34805”]
There’s a glMapBufferARB(int target, int access, long length, ByteBuffer old_buffer) in ARBBufferObject, isn’t there?

Right, I had to restart Eclipse.

Slightly better.

glMapBuffer(…, length, …) =>30ms


               glBufferDataARB(GL_ARRAY_BUFFER_ARB, byteCount, GL_STREAM_DRAW_ARB);

               ByteBuffer driverSideBuffer = null;

               for (int i = 0; i < 256; i++)
               {
                  driverSideBuffer = glMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_WRITE_ONLY_ARB, byteCount, driverSideBuffer);

                  javaSideBuffer.clear();
                  driverSideBuffer.clear();
                  driverSideBuffer.put(javaSideBuffer);
                  glUnmapBufferARB(GL_ARRAY_BUFFER_ARB);

                  glVertexPointer(3, GL_FLOAT, 0, 0);
                  glColorPointer(3, GL_FLOAT, 0, byteCount >> 1);
                  glDrawArrays(GL_TRIANGLES, 0, quadCount * 3 * 2);
               }

               glUnmapBufferARB(GL_ARRAY_BUFFER_ARB);

Cool, thanks.