Speed of vertex arrays

I recently embarked on my first major OpenGL project: converting Art of Illusion (www.artofillusion.org) to use Jogl for interactive rendering in place of the software renderer it was using before. Following the principle of, “first make it work, then worry about speed,” I initially implemented it in a very straightforward way, doing all the rendering in immediate mode. The result was about twice as fast as the pure software renderer - not terrible, but I knew it could do a lot better.

I did some reading about OpenGL optimization, and decided that vertex arrays looked like a good first thing to try. So I converted it to use arrays for the vertices and normals, and that sped it up by about 30%. That’s something, but I was hoping for a lot more.

I profiled the code, and immediately saw the problem. By far the biggest chunk of time was taken up filling the FloatBuffers that hold the arrays. In fact, building the arrays took more than twice as much time as the calls to glDrawArrays() that actually rendered them! The problem is that calling put() on a direct buffer is a very expensive operation, involving a native method call. So the actual rendering is much faster, but preparing the arrays takes so much time that the overall speed improvement is very small.

There’s also a form of put() that takes an array, so I tried using that instead. The result was much worse. In fact, this was even slower than just rendering in immediate mode! Filling in the array takes negligible time, but passing it to put() is incredibly slow.

Is this just a case where Java is inherently slower than a lower level language like C? Or is there some better way of building the arrays that I’m missing?

This is using the latest Jogl build, running under Java 1.4.2 on Mac OS X 10.3.8.

Thanks!

Peter

The intrinsics which implement direct buffers’ put and get methods are optimized in both the HotSpot Client and Server VMs to generate machine instructions in place of what appear to be native method calls. Take a look at the VertexArrayRange and VertexBufferObject demos in the jogl-demos workspace; the VertexArrayRange demo (which unfortunately hasn’t been ported to Mac OS X – I haven’t looked into GL_APPLE_vertex_array_range) achieves nearly the speed of C with the HotSpot client compiler and is generally faster than C speed with the HotSpot server compiler because HotSpot can generate SSE instructions for the inner loop, which a C compiler won’t do unless you compile for a specific architecture (i.e. Pentium IV or greater). You may want to try using the BufferUtils allocation methods if you aren’t already.

How are you profiling your Java code? In the 1.4.x series of VMs running HotSpot’s internal profiler with the -Xprof command line option should probably give the most accurate information. This is a flat profiler, not a hierarchical one, but has very low overhead and usually provides good information. Look for the time spent in compiled code in various threads.

You may also want to look into how you’re using JOGL. If you’re developing an interactive application you probably don’t need to use JOGL’s Animator class to drive an event loop. The HWShadowmapsSimple and InfiniteShadowVolumes demos in the jogl-demos workspace were recently converted to eliminate the use of the Animator to show how to make a low-CPU-usage application. If you’re using one or more Animators in your app you may be swamping the CPU.

[quote]The intrinsics which implement direct buffers’ put and get methods are optimized in both the HotSpot Client and Server VMs to generate machine instructions in place of what appear to be native method calls.
[/quote]
Where did you get this information from? It’s certainly different from what I’ve read about Hotspot, not to mention my own experience with JNI (which is that native methods have much more overhead than ordinary method calls). Then there’s the simple empirical fact that in my program, the calls to put() are clearly not getting optimized away, because they’re taking a huge amount of time.

Following up in more detail, I called getClass().getName() on my FloatBuffer to determine exactly what variety it is. It turns out to be a DirectFloatBufferU. Looking up the source code for that class, I find that put() is implemented as follows:

public FloatBuffer put(float x) {
  unsafe.putFloat(ix(nextPutIndex()), ((x)));
  return this;
}

unsafe.putFloat() is simply declared as an ordinary native method - nothing magic there.

[quote]How are you profiling your Java code? In the 1.4.x series of VMs running HotSpot’s internal profiler with the -Xprof command line option should probably give the most accurate information.
[/quote]
I used both -Xprof and -Xrunhprof (using PerfAnal to view the results).

[quote]You may also want to look into how you’re using JOGL. If you’re developing an interactive application you probably don’t need to use JOGL’s Animator class to drive an event loop.
[/quote]
I’m not using Animator, but thanks for the suggestion. There are definitely other optimizations I can try. Next up is probably to convert all my meshes to triangle strips, since that will reduce the size of the arrays. But that’s going to be a lot of work, and I’ll still probably be spending more time building arrays than rendering geometry.

Peter

[quote]Where did you get this information from?
[/quote]
I work on the Java HotSpot VM at Sun Microsystems and implemented some of the optimizations for direct buffers in the client compiler. Both the client and server compilers contain intrinsics for sun.misc.Unsafe which turn those native method calls into just a few machine instructions. Additionally the client compiler has an intrinsic for the direct buffers’ range check because otherwise its generated code quality decreased too significantly. The server compiler didn’t need this optimization.

[quote]I used both -Xprof and -Xrunhprof (using PerfAnal to view the results).
[/quote]
I’m still surprised that -Xprof shows a lot of time being spent in DirectFloatBufferU.put(). That method should be inlined into the caller so that it should disappear from the Compiled code section of the profile. If you try running e.g. the VertexBufferObject demo out of the jogl-demos workspace with -Xprof you should see that only a tiny fraction of the ticks go into that method because it quickly gets inlined into its caller.

Are you using both direct and non-direct FloatBuffers in your application? That could cause HotSpot to not be able to inline DirectFloatBufferU.put() and could be the cause of your slowdown. At the moment the only good workaround for this is to avoid instantiating any FloatBuffers through FloatBuffer.wrap().

Thanks for your patience! I may have been misinterpreting the profiler output. To be precise, hprof doesn’t show the CPU time being spent in put() itself, but rather in the lines that call put(). But since method inlining often causes it to lose the top level or two of the stack, I didn’t think much of that. The lines in question look like this:

vertBuffer.put((float) v.x);
vertBuffer.put((float) v.y);
vertBuffer.put((float) v.z);

and so on for all three vertices and all three normals that define each polygon. As you see, these lines only do three things:

  1. Look up the value of a field.
  2. Cast a double to a float.
  3. Call put().

Since I expected 1 and 2 to take negligible time, and a look at the JDK source code showed a lot happening inside put(), I concluded that was where the time was getting spent.

As a first experiment to confirm that hprof wasn’t confused about what lines were taking up most of the time, I commented out the calls to glVertexPointer(), glNormalPointer(), and glDrawArrays(). As expected, the drawing time didn’t decrease significantly, even though nothing was actually getting drawn.

I then replaced each of the lines above with

vertBuffer.put(1.0f);

Sure enough, the time decreased by about 1/3. So looking up the value and/or casting it to a float really was taking significant time. To determine which, I tried

double d = 1.0;

vertBuffer.put((float) d);
vertBuffer.put((float) d);

and surprisingly, that was even faster. (Though on reflection, I probably shouldn’t have been surprised.) So the slow operation seems to be looking up the field value, which implies that the real bottleneck is memory bandwidth.

That seems plausible, though it also points to a discouraging conclusion: I shouldn’t bother even trying any further OpenGL optimizations, because none of them will do any good. The bottleneck is simply transfering the scene geometry from memory to the processor, and there’s no way I can avoid doing that.

The one thing that might help would be copying some of the vertex coordinates into arrays that can be reused between renders. That way it would at least be more contiguous in memory. At the cost of increased total memory usage, of course…

Thanks a lot for your help!

Peter

Field accesses and memory traffic in general can become a bottleneck. If you are doing redundant field accesses in an inner loop you should hoist those accesses into local variables outside the loop so the compiler can more easily optimize them. This is less of an issue with the server compiler but it isn’t currently available on Mac OS X. However I think the main issue you have is the representation of geometry in your application. If you store models as e.g. a Point3d[] along with connectivity information then you are probably not getting any cache coherence for the actual points. If this is the case you may want to change your representation to a “Point3dCollection” class. Its internal representation can be a direct FloatBuffer and you can have a get() operation which either returns a new Point3d every time or one out of a small internal pool (if performance testing indicates it is necessary). This way the rest of your app which needs to operate on Point3ds can continue to do so, but you can pass the direct FloatBuffer down to OpenGL unmodified.

You may want to look at the VertexArrayRange demo in the jogl-demos workspace and the sources of the JCanyon demo. JCanyon dynamically generates all of its geometry on the fly each frame using a simple level-of-detail algorithm and achieves a throughput of millions of triangles per second.