Immediate mode rendering is dead

princec · January 19, 2010, 6:16pm

Riven, you’re throwing away the mapped Buffer each time. Can you try your tests passing the previous buffer into the method so that it can be re-used if possible?

Also - each iteration of the test isn’t one frame, it’s one object. Without a swapbuffers in there, and a few other state changes, this really isn’t testing much of use.

Cas

Riven · January 19, 2010, 6:29pm

We are comparing VertexArrays <=> VBO, not glBegin/glEnd <=> VBO

(unless you mean VertexArray with ‘immediate mode’ but I’m fairly sure that is not correct)

Riven · January 19, 2010, 6:33pm

I’m not throwing it away…


               glBufferDataARB(GL_ARRAY_BUFFER_ARB, byteCount, GL_STREAM_DRAW_ARB);

               ByteBuffer driverSideBuffer = null;

               for (int i = 0; i < 256; i++)
               {
                  driverSideBuffer = glMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_WRITE_ONLY_ARB, byteCount, driverSideBuffer);
                  javaSideBuffer.clear();
                  driverSideBuffer.clear();
                  driverSideBuffer.put(javaSideBuffer);
                  glUnmapBufferARB(GL_ARRAY_BUFFER_ARB);
...
               }

Isn’t this about throughput? If you really want to take everything into account, post a demo that can toggle among the 3 modes. Too much work? Well, I’m lazy too

princec · January 19, 2010, 7:09pm

Meh, what do I know anyway My sprite engine (well, it’s more of a 2d scenegraph now I suppose) is faster, that’s all I care about!

Cas

lhkbob · January 20, 2010, 5:13am

Perhaps the valuable lesson for now is that VAs are still valuable for very dynamic geometry since the vbo’s slower update doesn’t outweigh the rendering benefits. From my personal experience VBOs offer a very consistent speed boost when they’re not constantly being updated, but this is to be expected.

Also, for people who encourage the use of display lists, I’ve had troubling issues with them on Mac hardware. I’ve seen cases where rendering with them is significantly slower than vbos or vas, and where it’s much faster. It’s also been the cause (or a very coincidentally unrelated) of odd graphical glitches with the Mac windowing manager/compositor.

princec · January 20, 2010, 9:54am

The internal format of data in DLs is also very very slightly different in some cases than arrays or immediate mode, which leads to rendering artifacts. I forget where I read this - it was a long time ago - but it was the final nail in the coffin for me.

Cas

xinaesthetic · January 20, 2010, 6:21pm

I concur that I’ve seen a program of mine making quite heavy use of display lists running much much worse on a powerbook with afaik semi-decent graphics than on a pretty basic older windows laptop with integrated graphics… it was ok on a PowerMac with I think 8600GT (as one would hope).

Riven · January 21, 2010, 10:10am

I bet this is only showing when rendering identical VA/VBO and DL geometry, (maybe) causing z-fighting and slightly different edges in the rasterization step. Minecraft is built entirely using DLs, and from what I see it is ‘good enough’. Maybe you should ask Markus Persson what mysterious bug reports he gets from his players.

princec · January 21, 2010, 11:39am

If the whole thing’s DLs then that’s probably perfectly fine.

Cas

Markus_Persson · February 3, 2010, 3:27pm

Strangely, Minecraft runs way slower on my new computer than it did on my old one despite the graphics card being much better.

I’m going to try implementing a pure VBO rendering path… some day… soon, maybe… This thread is very informative.
I’ll post results.

princec · February 3, 2010, 6:15pm

That’s quite possibly because DLs are being “emulated” now rather than a first-class driver citizen. If you see what I mean.

Cas

princec · February 4, 2010, 1:51pm

Another followup:

Current bottleneck is glCheckError() calls at 35% native time. Obviously when performance testing (and even in a released game) I don’t care to check for errors - but LWJGL inconveniently and definitely wrongly forces a call to glCheckError() on every display update. Unfortunately this causes a pipeline flush for some reason (hence the unreasonably lengthy time spent in this method). I hacked it out of LWJGL, so that it now only occurs when in LWJGL debug mode.

Next bottleneck - glMapBufferARB() is making a call to the driver to get the current size of the currently mapped buffer - again causing a pipeline flush/stall. Now that’s taking 35% of my native time. So I switched to the latest LWJGL nightly (and reapplied the check error hack) and used the new glMapBuffer() method that takes a size argument - why the method doesn’t take the capacity() of the buffer is a bit odd but there we go, as that’s the only safe argument to actually pass in at this point as the limit() can change after the mapping is made. Some small improvement in framerate is made - good. I’m on the right track here definitely.

Now glMapBufferARB() itself is the actual bottleneck. Hmm. Why should this be taking 20% of my native time? Ahh of course - because it’s probably locked by the GPU. The solution is very simple - double buffer it. So I now use two identically sized VBOs, and swap them each frame. The GPU reads from one while I write to the other.

Suddenly I’m getting a 50% increase in frame rate. There may be a bit more to come if I try triple buffering the VBOs as well but I’m not quite sure if that’s actually going to make any difference (even if my display is triple buffered).

Now StrictMath.floor() is the native bottleneck - grr - using a surprisingly large 5% of my native time for what I thought was a trivially intrinsified operation (turns out it’s not - at least, not on my Turion). Anybody got a quickie workaround hack to avoid using floor()?

Cas

Riven · February 4, 2010, 2:00pm

from Ken Perlin’s simplex noise:

// This method is a *lot* faster than using (int)Math.floor(x)
private static int fastfloor(double x) {
return x>0 ? (int)x : (int)x-1;
}

Mike · February 4, 2010, 2:14pm

I belive Matzon said that the plan was to have a lwjgl and a lwjgl-debug (where one is used for development and one for production).

Also, what app/add-on do you use to check the native time? I mostly debug my apps using VantageAnalyzer but that one only tells the method times inside the .class.

Mike

princec · February 4, 2010, 2:54pm

Good old Ken.

I’m using -Xprof - works well enough for my purposes (even though it does slow things down a little itself).

Cas

pjt33 · February 4, 2010, 3:02pm

Should be >=, not >, unless you want floor(0) == -1.

Markus_Persson · February 4, 2010, 3:27pm

It’s still not correct. Consider fastFloor(-1)
But it is fast.

elias4444 · February 4, 2010, 3:32pm

Hey again Cas,

I went ahead and ran my benchmarker with XProf so I could compare it to your findings. I’m using an LWJGL nightly build from a few days ago (right after the ATI driver issue was fixed). I don’t even get glCheckError() as a blip on the radar. The big one for me is MacOSXContextImplementation.nSwapBuffers (which kind of makes sense) and then glDrawArrays. Am I just missing something?

Spasi · February 4, 2010, 3:33pm

[quote=“princec,post:52,topic:34805”]
That’s weird, you shouldn’t need any hack for that. Since 2.2.0 glCheckError() is only called during display update when org.lwjgl.util.Debug is set to true. See this post.

princec · February 4, 2010, 3:43pm

Hm, I’m almost absolutely certain I had to put an if (LWJGLUtil.DEBUG) {} check around the call last night to stop it from checking. I will report back later when I get back from work.

@4x4: if you’re blocked in swapBuffers, that just means that the GPU still has some rendering to do to finish the current frame. Triple buffering can help a bit here I think, but that’s buried in the drivers/OS and beyond LWJGL’s direct control.

Cas