Another followup:
Current bottleneck is glCheckError() calls at 35% native time. Obviously when performance testing (and even in a released game) I don’t care to check for errors - but LWJGL inconveniently and definitely wrongly forces a call to glCheckError() on every display update. Unfortunately this causes a pipeline flush for some reason (hence the unreasonably lengthy time spent in this method). I hacked it out of LWJGL, so that it now only occurs when in LWJGL debug mode.
Next bottleneck - glMapBufferARB() is making a call to the driver to get the current size of the currently mapped buffer - again causing a pipeline flush/stall. Now that’s taking 35% of my native time. So I switched to the latest LWJGL nightly (and reapplied the check error hack) and used the new glMapBuffer() method that takes a size argument - why the method doesn’t take the capacity() of the buffer is a bit odd but there we go, as that’s the only safe argument to actually pass in at this point as the limit() can change after the mapping is made. Some small improvement in framerate is made - good. I’m on the right track here definitely.
Now glMapBufferARB() itself is the actual bottleneck. Hmm. Why should this be taking 20% of my native time? Ahh of course - because it’s probably locked by the GPU. The solution is very simple - double buffer it. So I now use two identically sized VBOs, and swap them each frame. The GPU reads from one while I write to the other.
Suddenly I’m getting a 50% increase in frame rate. There may be a bit more to come if I try triple buffering the VBOs as well but I’m not quite sure if that’s actually going to make any difference (even if my display is triple buffered).
Now StrictMath.floor() is the native bottleneck - grr - using a surprisingly large 5% of my native time for what I thought was a trivially intrinsified operation (turns out it’s not - at least, not on my Turion). Anybody got a quickie workaround hack to avoid using floor()?
Cas