Slow Vertex Arrays - NIO Evil??

In the constant pursuit of improving peformance of my engine, I discovered something very surprising - Vertex Arrays aren’t actually any faster then immediate rendering.

For shits and giggles, I removed the call to the vertex arrays and called the immediate mode renderer method instead. My FPS actually increased by 1!

Here is the Array code:


        // Vertices
        gl.glEnableClientState(GL.GL_VERTEX_ARRAY);
        gl.glVertexPointer(3, //3 components per vertex (x,y,z)
            GL.GL_FLOAT, 0, vertex_buffer);

        // Color 3 (OPTIMIZATION ASSUMPTION: Only 1 color per object)
        if (color_buffer3 != null) {
          gl.glColor3f(color_buffer3.get(0),color_buffer3.get(1), color_buffer3.get(2));
        }

        // Color 4
        if (color_buffer4 != null) {
          gl.glEnableClientState(GL.GL_COLOR_ARRAY);
          gl.glColorPointer(4, //4 components per color (r,g,b,a)
              GL.GL_FLOAT, 0, color_buffer4);
        }

        // Texture coordinates 2D
        if (coords_buffer1 != null) {
          gl.glEnableClientState(GL.GL_TEXTURE_COORD_ARRAY);
          gl.glTexCoordPointer(2, //2 components coord
              GL.GL_FLOAT, 0, coords_buffer1);
        }

        // Normals
        if (normal_buffer != null) {
          gl.glEnableClientState(GL.GL_NORMAL_ARRAY);
          gl.glNormalPointer(GL.GL_FLOAT, 0, normal_buffer);
        }

        gl.glDrawArrays(GL.GL_TRIANGLES, 0, vertex_buffer.limit()/3);

        // Reset client state
        if (normal_buffer != null)
          gl.glDisableClientState(GL.GL_NORMAL_ARRAY);

        if (coords_buffer1 != null)
          gl.glDisableClientState(GL.GL_TEXTURE_COORD_ARRAY);

        if (color_buffer3 != null || color_buffer4 != null)
          gl.glDisableClientState(GL.GL_COLOR_ARRAY);

        gl.glDisableClientState(GL.GL_VERTEX_ARRAY);

This is the immediate mode renderer code:


        // Render
        gl.glBegin(GL.GL_TRIANGLES);

        // engine only uses one color per geo object, so all verts should
        // have the same color, UNLESSS multi_color is explicitly set
        if (color_buffer3 != null && !multi_color) {
          gl.glColor3f(color_buffer3.get(0), color_buffer3.get(0 + 1),
              color_buffer3.get(0 + 2));
        }
        if (color_buffer4 != null && !multi_color) {
          gl.glColor4f(color_buffer4.get(0), color_buffer4.get(0 + 1),
              color_buffer4.get(0 + 2), color_buffer4.get(0 + 3));
          colors4 += 4;
        }

        for (int i = 0; i < vertex_buffer.limit(); i += 3) {

          if (multi_color) {
            if (color_buffer3 != null) {
              gl.glColor3f(color_buffer3.get(i), color_buffer3.get(i + 1),
                  color_buffer3.get(i + 2));
            }
          }

          if (normals != null)
            gl.glNormal3f(normal_buffer.get(i), normal_buffer.get(i + 1), normal_buffer.get(i + 2));

          if (app != null && app.getTexture() != null)
            if (coords_buffer1 != null) {
              gl.glTexCoord2f(coords_buffer1.get(texInd), coords_buffer1
                  .get(texInd + 1));
              texInd += 2;
            }

          gl.glVertex3f(vertex_buffer.get(i), vertex_buffer.get(i + 1), vertex_buffer.get(i + 2));

        }
        gl.glEnd();

The effect of both is identical in terms of what they render (actually the immediate mode can handle colors per vertex…but I don’t use them anyway) and how they get their data, but the immediate mode runs 1 FPS faster!

Is this an evil of NIO? Is it that under the hood copying the data to the card from the NIO buffer takes just as long as it would to call the glXXXXX methods anyways?

I’m not sure what’s different about your code, but I notice a massive drop in framerate when I stop using glDrawArrays.

Hmm… Maybe it’s something to do with my video card? Geforce 5500FX, latest drivers. I find that unlikely though, because nVidia is pretty hard core about opitimizing and vertex arrays have been around since OpenGL 1.1. I would think they would have optimized it as much as possible long ago.

How many polygons are in your array, and how are you allocating it?

It’s actually several arrays that compose the objects on screen. They range from about 100-1000 verts each plus corresponding normals and texture coordinates, all sent as arrays. Each are allocated as direct ByteBuffers, then tranformed into FloatBuffers (long before the rendering stage).

I tried using ByteBuffer.allocateDirect(size).asFloatBuffer() early on and it wasn’t working for me… so I found the BufferUtils class in jogl. Now I’m using BufferUtils.newFloatBuffer(size) instead.

I have no idea if it makes a difference or not.

That method does the same thing - Allocates a direct ByteBuffer the gets a FloatBuffer from that. If doing it yourself wasn’t working, you were probably forgeting to set the ByteBuffer to native order.

Ok… so with buffer.order(ByteOrder.nativeOrder()); added into the code, it now works just as well as the BufferUtils method.

If the your ByteBuffers are direct buffers, there shouldn’t be any copying. That’s kind of the point of these things :slight_smile:
It’s really strange that the frame rate actually increases when rendering without vertex arrays, since you’re doing many extra JNI calls. The overhead of those must not be as big as I thought. Maybe you could use the TraceGL pipeline to check if you’re making unnecessary calls in the vertex array case? Or the DebugGL to check for errors?
Something you could also try is reducing the number of glEnable/DisableClientState calls you are making. Calling these methods probably triggers a state validation in the driver which might slow things down. The enabling and disabling of the vertex array for instance seems pointless since you are always using it.
IIRC going from vertex3f to drawArrays made a big difference in my rendering code. Something that can give you a boost as well is using triangle strips and/or fans instead of triangles (don’t know if this is feasible in your specific case of course).

Doing a pure glDrawArrays direct from system RAM is the worst way to use vertex arrays. It’s probably doing a copy from system RAM into AGP RAM and thence onwards to the card, every frame; and what’s more if you’re filling the system RAM buffers up with data every frame you’ve shafted your cache unneccessarily. You need to be using, at the very least, glDrawRangeElements; EXT_compiled_vertex_array is a crap extension and shouldn’t be used any more. NV_vertex_array_range2 is the next best solution on Nvidia cards. And the very best way to do it is Vertex Buffer Objects.

What’s more this bit of code quite easily could just be a microbenchmark anyway with all the silliness that entails. No mention of overheads, number of vertices, pixels drawn, fill rates, etc. etc. It’s a testament to Nvidia’s driver writers that they’ve got immediate mode as fast as plain vertex arrays in your special case.

Cas :slight_smile:

I’m curious as to what makes glDrawRangeElements inherently more efficient than glDrawArrays?

The way glDrawArrays works is that you are effectively saying, “all this data needs to be drawn, entirely, now.” but rarely do you actually have this situation. Normally you are drawing chunks of elements from a big load of vertex data. So typically you’ll be using glDrawElements anyway. But underlying it all:

glDrawArrays can be implemented by the drivers in various ways but one of the ways it’ll be optimised is to copy system RAM data out to AGP, and then let the card suck it over in its own time. It’s not that glDrawArrays is slow; its the particular kind of RAM you’re using that’s slow. The usage of glDrawArrays won’t make any difference here - it’s purely down to where those buffers came from.

Now then… glDrawRangeElements specifies a minimum and maximum bounds for the indices normally used by glDrawElements. What this does is precisely enable the drivers to determine the range of vertex memory pointed at, and copy only that which it is necessary to copy over to AGP RAM (along with a few other optimisations). If that data is already in AGP RAM by virtue of NV_vertex_array_range or VBOs, then the optimisations are not nearly as significant.

I’m wired on coffee and starving so I might be rambling and not making much sense here.

Cas :slight_smile:

Cas :slight_smile:

Hmm, I can’t say I completely understand yet. Given the following methods:

void glDrawArrays(GLenum mode, GLint first, GLsizei count);
void glDrawElements(GLenum mode, GLsizei count, GLenum type, const GLvoid *indices);
void glDrawRangeElements(GLenum mode, GLuint start, GLuint end, GLsizei count, GLenum type, const GLvoid *indices);

The only difference between glDrawArrays and glDraw(Range)Elements is the use of indirection via the indices array. I don’t see how one of these methods could be noticeably more effient than the other. To me it seems that they just serve different purposes. If you don’t need the indexing provided by draw(Range)Elements, why use it in the first place?
I have to admit I’ve never done any low level OpenGL profiling/debugging so I’m only looking at this at an API level…

Thought about this a bit more…
The only way I see glDraw(Range)Elements being inherently faster than glDrawArrays is if you reuse vertices multiple times. Using the draw(Range)Elements methods you just refer to them multiple times in the indices array, whereas with drawArrays you have to repeat the vertex/color/tex coord data each time, which means you’re pushing more data to the video card. Wild guess, but maybe the driver can also reuse calculations if it encounters the same index twice?
Does this make any sense?

It’s specifically to do with optimising what data the drivers send over to the graphics card. If you’ve already decided what format and where the data is going to be by using VBO or NVVAR, it’s neither here nor there - probably (unless theres some interesting “paging” or “windowing” or something going on in the hardware AGP bus). The drivers may still copy your data into a more efficient format. But in the case of plain old direct buffers of vertex data, the driver has absolute discretion of how it’s going to get the data to the card and this usually involves it copying it to AGP RAM. If you are able to be more specific about what data you’re going to be needing you should use glDrawRangeElements because then the driver doesn’t need to needlessly copy tons of data or parse the entire index array to work out a min/max vertex.

Cas :slight_smile:

Well, if the entire array needs to be drawn, this is supposed to be the fastest way because it causes the least indrection, as opposed to the other GL array functions which are better at working with portions of arrays.

glDrawRangeElements is better if you only want to use a portion of the data, I am drawing all vertices, normals and tex coords. Using drawRangeElements wouldn’t improve it. VBO’s would, but that’s VBO’s not vertex arrays and not OpenGL1.1.

This not a full blown benchmark but it’s a pretty realistic one. It’s the renderer in my game for the ships and cockpit. It’s a very diverse set of data ranging from small to farily large quantities of vertices and different combinations of normals - no normals, tex coords - no tex coords. The test is to remove the static objects (roids, planets, etc) and leave nothing but moving objects and then restrict all movement of the ships, but allow them to rotate.

I do agree this is a testament to nVidia’s drivers, but I really don’t think this is a special case.

I have that situation all the time. My vertex arrays are all self contained in a scene graph structure. Frustum culling determines exactly which nodes need to be drawn and then they draw themselves entirely. This isn’t a sliced buffer, each has it’s own. glDrawArrays isn’t supposed to be faster for data transfer (though you save some), the performance boost is supposed to be in requiring less OpenGL calls…alot less calls. The same amount of data or less will go to the card either way. What I don’t understand is why it’s not doing what it says on the box…faster because uses far less OpenGL calls and cards may be able to optimize becuase of predictability of the vertexes being rendered (they are all triangles to render for the next 1000 verts for example).

Although that’s all true, it still doesn’t apply if you do actually want to render all the buffers pointed to entirely, which I do. For a bulk render, where all the buffers pointed to through glVertexPointer, glNormalPointer, etc. is needed, drawArrays should be faster then drawElements…and it should be much faster then using the immediate calls. According to the red book, the only reason it would be as slow as immediate calls is if I have geometry saturation on the card, which really means that the card itself has become the bottleneck…which can’t be the case. Considering JNI’s overhead, in Java I would expect it to outperform the immediate calls by even more then in C.

From the OpenGL man pages:

[quote]Implementations denote recommended maximum amounts of vertex and index data, which may be queried by calling glGet with argument GL_MAX_ELEMENTS_VERTICES and GL_MAX_ELEMENTS_INDICES. If end - start + 1 is greater than the value of GL_MAX_ELEMENTS_VERTICES, or if count is greater than the value of GL_MAX_ELEMENTS_INDICES, then the call may operate at reduced performance.
[/quote]
This bit is only mentioned in the context of glDrawRangeElements, however it might be equally valid for glDrawArrays. Maybe you can try limiting the amount of data you are passing by doing multiple calls to glDrawArrays taking that GL_MAX_ELEMENTS_VERTICES value into account. I don’t have a clue about all the memory transfers involved, but it might be worth a try :slight_smile:

Hang on a sec: it really is a rare case to call glDrawArrays unless you have some very peculiar characteristics of your scenegraph. These scenegraphs notably being that the entire scene has a single rendering state (texture, same vertex format) and I can’t think of many games where this is the case. Even my crappy 2D games have tons of different rendering states operating on the same vertex data.

Cas :slight_smile: