Throughtput problem with VBOs

bk0 · January 29, 2012, 3:23am

Hi everybody.

I’m a bit stumped. It seems as if my program would resend the contents of my VBO each frame.

Specs:
AsRock M3N78D, nForce 720D, AM3
AMD Phenom II X6 1100T Black Edition, 3.3GHz
Radeon HD 6950 2GB, PCI-E x16 2.0, DP

Software:
Updated drivers (CPU/Chipset as well as GPU)
JOGL 2.0 RC5
Eclipse (shouldn’t matter, though)

Program setup:
FPSAnimator is supposed to call display() as often as possible (up to 1000 FPS).
Game uses a Octree, and Occlusion Culling as well as Frustum Culling are implemented. (On said Octree)
All to-be-rendered triangles are stored in a VBO

What I’d like to do:
Render as many Triangles as possible. (just for starters) Right now I render Cubes.

Benchmarking:
Speed is directly influenced by the size of the VBO.
3 072 MB: I allocate using integer (32 Bit), so I get an overflow (=crash).
1 536 MB: 1 FPS (I get graphical errors, such as lines across the entire screen. )
768 MB: 2 FPS (From here on all looks nice)
384 MB: 4 FPS
192 MB: 8 FPS
96 MB: 16 FPS
48 MB: 32 FPS
24 MB: 60 FPS
12 MB: 118 FPS
6 MB: 200 FPS
3 MB: 350 FPS
1.5 MB: 510 FPS
750KB: 810 FPS

The VBO is constructed ONCE and then no longer updated (I disabled updating for now).

Clearly, this stuff should run faster. Could you have a look and maybe see a bug I’ve overlooked? Maybe something simple?
I mean, 24 MB = 360k Triangles surely isn’t anywhere the limit my Hardware has

My rendering Code looks like this:


	public void display(GLAutoDrawable drawable) {

		Now = System.nanoTime();
		MSSinceLastFrame = (double) (Now - LastCall) / 1000000;
		LastCall = Now;
		System.out.println("MS since last call:	" + MSSinceLastFrame + "	FPS:	" + 1000/MSSinceLastFrame);

		do_look();	                               //Adjusts lookatx, lookaty, lookatz

		do_move(MSSinceLastFrame);        //Adjusts posx, posy, posz
		
		GL2 currGL = drawable.getGL().getGL2();
		
		currGL.glClear(GL.GL_COLOR_BUFFER_BIT | GL.GL_DEPTH_BUFFER_BIT);   //Reset
                currGL.glLoadIdentity();
		glu.gluLookAt(posx, posy, posz, lookatx, lookaty, lookatz, 0, 1, 0);          //Player position and direction

		currGL.glBindBuffer(GL.GL_ARRAY_BUFFER, vbo_handle);
		
		currGL.glEnableClientState(GL2.GL_VERTEX_ARRAY);
		currGL.glEnableClientState(GL2.GL_TEXTURE_COORD_ARRAY);  
		currGL.glEnable(GL.GL_TEXTURE_2D);     
		
		currGL.glBindTexture(GL.GL_TEXTURE_2D, texture.getTextureObject(currGL));

		currGL.glVertexPointer(3, GL.GL_FLOAT, 5 * 4, 0);                     //Each Vertex has 3 Coords...
		currGL.glTexCoordPointer(2, GL.GL_FLOAT, 5 * 4, 3 * 4);            //... and 2 Texture Coordinates, packed interleaving
		
		currGL.glDrawArrays(GL.GL_TRIANGLES, 0, 4 * Buffer.capacity()); //Render the whole Buffer

		currGL.glDisableClientState(GL2.GL_VERTEX_ARRAY);
		currGL.glDisableClientState(GL2.GL_TEXTURE_COORD_ARRAY);  

		currGL.glBindBuffer(GL.GL_ARRAY_BUFFER, 0);
		currGL.glBindTexture(GL.GL_TEXTURE_2D, 0);
		
		double tmp = System.nanoTime();
		double mspassedwhilerendering = (tmp - LastCall) / 1000000;
		System.out.println("MS Passed on CPU:	" + mspassedwhilerendering + "	FPS (CPU):	" + 1000/mspassedwhilerendering);

		currGL.glFinish();
		drawable.swapBuffers();

This is how I initialize things:


@Override
	public void init(GLAutoDrawable drawable) {
		drawable.setAutoSwapBufferMode(false);
		
		GL2 gl = drawable.getGL().getGL2();
		glu = new GLU();
		gl.glClearColor(0.0f, 0.0f, 0.0f, 0.0f);
		gl.glClearDepth(1.0f);
		gl.glShadeModel(GL2.GL_SMOOTH);
		gl.glEnable(GL.GL_DEPTH_TEST);
		gl.glDepthFunc(GL.GL_LEQUAL);
		gl.glEnable(GL.GL_TEXTURE_2D);

		// reshape
		glu.gluLookAt(posx, posy, posz, lookatx, lookaty, lookatz, 0, 1, 0);
		glu.gluPerspective(45.0, SCREEN_WIDTH / SCREEN_HEIGHT, 1, 100);
		// fov, aspect ratio, near & far clipping planes
		
		if (vbo_handle <= 0) {
			int[] tmp = new int[1];
                        gl.glGenBuffers(1, tmp, 0);
                        vbo_handle = tmp[0];
                }
		Buffer = Buffers.newDirectFloatBuffer(90*65536);	//==65k Cubes == 6M Floats == 24 MBytes
		
		int numBytes = Buffer.capacity() * 4;// Allocate the Buffer (Data is set on a per-Octree-Leaf basis!)
                gl.glBindBuffer(GL.GL_ARRAY_BUFFER, vbo_handle);
                gl.glBufferData(GL.GL_ARRAY_BUFFER, numBytes, null, GL.GL_DYNAMIC_DRAW);
                gl.glBindBuffer(GL.GL_ARRAY_BUFFER, 0);
	}

And for each Leaf in my Octree, I do this (just to be clear: This is done exactly once per leaf, and never repeated)


int numBytes = 30 * 3 * cubes * 4;            //cubes < 512 in every case (this is performed in the Leaf of the Octree)
WWV.gl.glBindBuffer(GL.GL_ARRAY_BUFFER, WWV.vbo_handle);
WWV.gl.glBufferSubData(GL.GL_ARRAY_BUFFER, myPos * 4, numBytes, WWV.Buffer);
WWV.gl.glBindBuffer(GL.GL_ARRAY_BUFFER, 0);

Do you see anything wrong? Or is there an example which I could look at?

All tips/suggestions welcome ;), and thanks for the help.

theagentd · January 29, 2012, 3:52am

Your card “only” has 2GB of memory. You’re probably not gonna get realtime performance with over 1GB of vertices. >_>

The whole point of VBOs is that the data should be stored in VRAM, unless GL_STREAM_DRAW is passed when the memory is allocated/uploaded with glBufferData in which case the driver is allowed to not store the data in VRAM. I experimented with this value on my NVidia GPU and there was no difference at all, regardless of what I chose, so it seems like at least NVidia ignores this value.

I’m pretty sure your problem is not a vertex bottleneck but a fragment bottleneck. Try to make your objects cover a smaller area (scale them smaller or something), disable anti-aliasing and disable lighting e.t.c to reduce the per-pixel cost. My laptop’s GPU can draw around 1.4 million triangles per frame at 60 FPS without any texturing or anything. Considering you have a desktop computer and a desktop GPU, I’d estimate it to be 3-4x as fast, so around 5 million triangles per frame at 60 FPS would make sense. This again points at a fragment bottleneck.

bk0 · January 29, 2012, 4:21am

Of course I wasn’t expecting real-time performance with 1.5GB VRAM occupied

Okay, so we agree that the data is/should be residing in VRAM, that’s good. I used GPU-Z 0.5.8 to look at my VRAM usage: The correct amount is used. But this doesn’t show whether it resides there and there is a fragmentation bottleneck, or whether it is retransmitted and there is a bug in my implementation…

Now, I don’t really know anything about a “fragment bottleneck”. Could you give me some link?
Currently I use textured Triangles, without any lightning or transparency or normals etc. With deactivated textures the 24 MB scenario improves from 60 to 63 FPS. Not really what I hoped^^
No Triangles overlap (they do touch, though).
But there are many triangles hidden behind other triangles, could this cause problems? I always imagined not rendering triangles would be faster than rendering them^^

Maybe you could also give me some tips on how to implement your suggestions?
-cover a smaller area: You mean a smaller area on the screen? The cubes are the same measurements as in minecraft, and I run minecraft without stutters
-I never enabled VSync or anything else. I just did:


public Main() {
		super("");
		//
		GLProfile glp = GLProfile.get(GLProfile.GL2);
		caps = new GLCapabilities(glp);
		canvas = new GLCanvas(caps);

		canvas.addGLEventListener(render_VBO);
		canvas.addKeyListener(render_VBO);
		canvas.addMouseListener(render_VBO);
		canvas.addMouseMotionListener(render_VBO);

		getContentPane().add(canvas);

		setResizable(false);
		// setSize(1920, 1080);

		if (!isDisplayable())
			setUndecorated(true);
		setExtendedState(JFrame.MAXIMIZED_BOTH);
		// TODO: True Fullscreen
		// GraphicsDevice gd =
		// GraphicsEnvironment.getLocalGraphicsEnvironment().getDefaultScreenDevice();
		// gd.setFullScreenWindow(this);

		//Make Cursor Invisible
		Toolkit tk = Toolkit.getDefaultToolkit();
		Image image = tk.createImage("");
		Point point = new Point(0, 0);
		Cursor cursor = tk.createCustomCursor(image, point, "");
		setCursor(cursor);

		setVisible(true);
		canvas.requestFocus();
		canvas.requestFocusInWindow();

		FPSAnimator animator = new FPSAnimator(canvas, 1000);
		animator.add(canvas);
		animator.start();
	}

Could it be that Vsync or anything else is active by default and needs to be deactivated? If so, is there a list of what starts as active?

Thanks again for the tips, when googling for fragmentation bottleneck, I found this:

I’ll read through it and let you know.

Greetings

lhkbob · January 29, 2012, 4:29am

The count argument to glDrawArrays() represents the number of elements to be combined into primitives, not the number of bytes.

I think since you’ve packed 2-tuple tex coords and 3-tuple vertices, the number of elements to render is Buffer.capacity() / 5.

If this is indeed an error, it is likely the cause of your underwhelming performance. I have often experienced undefined and non-deterministic performance when there are errors like this. Some examples include weird slow downs, missing vertices, segfaults, etc. It all depends on where the memory is, and what the GPU tries to do to prevent invalid accesses, and how it recovers.

lhkbob · January 29, 2012, 4:32am

Years ago I remember running into a problem where pushing 1 million over-lapphing triangles (i.e. they were hidden behind other triangles), was much slower than when the triangles were more evenly distributed around the screen. I think the GPUs have a fast path for performing quick depth checks when there aren’t that many on the same pixel, or if the depths are far apart. If all of your triangles are packed together, it might have to go into slower, more accurate floating point checks.

bk0 · January 29, 2012, 6:17am

@lhkbob: YES. That was it. (with the /5 instead of *4) Silly me, not reading the doc!

And I’ll make sure to use a special algorithm to ensure no triangles are hidden. Already have something in mind^^

Thanks again guys, I think that should allow me to continue forward with my game

lhkbob · January 29, 2012, 6:55am

Well, just make sure that whatever special algorithm you’re using isn’t more expensive than relying on the GPU. In my story about hidden triangles causing slow downs, it was 1 million triangles packed into a 200x200 area in a larger window.

That situation is pretty contrived and probably wouldn’t show up in a real game.

theagentd · January 29, 2012, 10:10am

Okay, approximating the cost of drawing fragments is pretty easy:

“Fragments” that are outside the screen cost nothing since the triangle is culled to the screen edges.
Triangles that do not cover any pixels (or MSAA sample positions) do not cost anything.
Triangles that pass the depth test have a cost depending on what shader/fixed functionality your running.
Triangles that do NOT pass the depth test still have a cost:
- This cost mostly depends on whether Early-Z was used. Shaders that output a custom depth value per fragment have this disabled, meaning that the shader has to be run before the depth test. In this case the cost is almost the same as if it had passed the depth test.
- With Early-Z the cost is lower, but still not free. I’d estimate it to about half the cost of simple shading.

Your GPU can fill a huge number of pixels per frame. My little laptop can handle around 79 million colored pixels per frame at 60 FPS, but this number drops insanely fast if you add texturing, lighting, e.t.c. For reference, a 1920x1080p screen is approximately 2 million pixels, so with 4 million triangles they should cover just a few pixels each for the total number of fragments to be low enough to not be a bottleneck, so overdraw is something you want to avoid. And again, your GPU is around 3-4 times faster than mine. xd

bk0 · February 24, 2012, 6:40pm

Hello again

So far everything runs great and fast, too. Unless it crashes, that is

The error report is here: http://pastebin.de/23611
According to this report, the error happens in renderBuffer(). Here’s the source:


void renderBuffer() {
		
		glBindTexture(GL2.GL_TEXTURE_2D, WWV.tex_handles[LODLevel]);	//It's not the texture, that's almost guaranteed
		glBindBuffer(GL.GL_ARRAY_BUFFER, vbo_handle);					//The VBO Handle is good as well (see glGetBufferSubData below)
		
		System.out.print("Rendering Chunk	" + start_x + "	" + start_y + "	" + start_z);
		System.out.println("	vbo_handle:	" + vbo_handle + "	Buffer.capacity():	" + Buffer.capacity() + "	WWV.tex_handles[LODLevel]:	" + WWV.tex_handles[LODLevel]);
			
		Buffer.clear();
			
		glGetBufferSubData(GL.GL_ARRAY_BUFFER, 0, Buffer.capacity()*4, Buffer);
			
		for (int i = 0; i < Buffer.capacity(); i++) {
			System.out.print(Buffer.get(i) + " ");
		}
		System.out.println();

		glVertexPointer(3, GL.GL_FLOAT, 5 * 4, 0);
		glTexCoordPointer(2, GL.GL_FLOAT, 5 * 4, 3 * 4);
		
		System.out.println("Before drawing");
		glDrawArrays(GL.GL_TRIANGLES, 0, Buffer.capacity() / 5);
		
		System.out.println("Done");
	}

Buffer is of type FloatBuffer, and is used to initially create the VBO. The crashes always happen when Buffer.Capacity() == 180 and WWV.tex_handles[LODLevel] == 5. Following is one output (crashes happen randomly)
`
Rendering Chunk 352 768 448 vbo_handle: 11378 Buffer.capacity(): 180 WWV.tex_handles[LODLevel]: 5
368.0 784.0 448.0 0.0 0.0 368.0 768.0 448.0 0.0 0.0625 368.0 768.0 464.0 0.0625 0.0625 368.0 784.0 448.0 0.0 0.0 368.0 784.0 464.0 0.0625 0.0 368.0 768.0 464.0 0.0625 0.0625 352.0 784.0 448.0 0.0 0.0 352.0 768.0 448.0 0.0 0.0625 352.0 768.0 464.0 0.0625 0.0625 352.0 784.0 448.0 0.0 0.0 352.0 784.0 464.0 0.0625 0.0 352.0 768.0 464.0 0.0625 0.0625 368.0 768.0 464.0 0.0 0.0625 368.0 784.0 464.0 0.0 0.0 352.0 768.0 464.0 0.0625 0.0625 368.0 784.0 464.0 0.0 0.0 352.0 768.0 464.0 0.0625 0.0625 352.0 784.0 464.0 0.0625 0.0 368.0 768.0 448.0 0.0 0.0625 368.0 784.0 448.0 0.0 0.0 352.0 768.0 448.0 0.0625 0.0625 368.0 784.0 448.0 0.0 0.0 352.0 768.0 448.0 0.0625 0.0625 352.0 784.0 448.0 0.0625 0.0 368.0 784.0 448.0 0.0 0.0625 352.0 784.0 448.0 0.0 0.0 368.0 784.0 464.0 0.0625 0.0625 352.0 784.0 448.0 0.0 0.0 368.0 784.0 464.0 0.0625 0.0625 352.0 784.0 464.0 0.0625 0.0 368.0 768.0 448.0 0.0 0.0625 352.0 768.0 448.0 0.0 0.0 368.0 768.0 464.0 0.0625 0.0625 352.0 768.0 448.0 0.0 0.0 368.0 768.0 464.0 0.0625 0.0625 352.0 768.0 464.0 0.0625 0.0
Before drawing

A fatal error has been detected by the Java Runtime Environment:

(Rest of the error message, see pastebin)
`

Does anyone have an idea what could cause this?
It isn’t the texture, that works on other VBOs (I draw about 1800 VBOs each frame, about 1500 of them use this texture)
It isn’t the VBO itself, the data I grabbed seems okay (or do you see something wrong?)
It shouldn’t be the draw command, right?

Fun fact: As long as I just look around in my world, the game never crashes. Only when I move the program sometimes crashes (as I said, it is random. It’s not movement = crash).
If that could be the problem: I use KeyListener, and if some Key is pressed, a boolean is set to true, and when it is released it is set to false. There is no actual movement during the rendering of the scene. It is strictly one Thread performing movement and view, and THEN starting to render the world… This should not lead to any problems with each other, right?

Any Ideas? Maybe something I do completely wrong?
Thanks again for any hints and tips

theagentd · February 24, 2012, 7:28pm

Seems like an access violation problem. The functions are slightly different in JOGL, but shouldn’t

glGetBufferSubData(GL.GL_ARRAY_BUFFER, 0, Buffer.capacity()*4, Buffer);

be

glGetBufferSubData(GL.GL_ARRAY_BUFFER, 0, Buffer.capacity(), Buffer);

since that method takes a FloatBuffer (?), so it will automatically multiply it by 4 bytes per float. That could produce a crash at random times.

lhkbob · February 24, 2012, 7:54pm

From the JVM crash, it looks like the call to glDrawArrays() is what is causing the problem, not glGetBufferSubData.

Usually a JVM crash during a call to glDrawArrays or glDrawElements is because you are attempting to a reference a vertex that is outside of the valid range in the vertex attribute VBOs or arrays.

The arguments in your pointer setup look fine, so the only thing I can think of is that when you allocate the VBO in vbo_handle, it has a size smaller than 4 * 180 (which is what you’d expect if you called glBufferData with Buffer). Looking at your original code, it appears as though you are packing multiple octree leaf data into a large VBO, so there is a chance that this is screwed up.

I would also recommend using a DebugGL wrapper around your GL to check for errors.

bk0 · February 24, 2012, 8:33pm

The VBOs are no longer packed together.

But I followed your advice with DebugGL - and the call to glGetBufferSubData caused an error: GL_INVALID_OPERATION. If I read the specs right, this is only thrown if
GL_INVALID_OPERATION is generated if the reserved buffer object name 0 is bound to target.
http://www.opengl.org/sdk/docs/man/xhtml/glGetBufferSubData.xml

It’s already late in the evening, so my brain is half asleep - but I still can’t figure out why the reserved buffer object name 0 should be bound to GL.GL_ARRAY_BUFFER!?

Hm, on second thought I removed all the “debug” code, and this is what remained:


glBindTexture(GL2.GL_TEXTURE_2D, WWV.tex_handles[LODLevel]);
		glBindBuffer(GL2.GL_ARRAY_BUFFER, vbo_handle);
		
		glVertexPointer(3, GL2.GL_FLOAT, 5 * 4, 0);
		glTexCoordPointer(2, GL2.GL_FLOAT, 5 * 4, 3 * 4);

		glDrawArrays(GL2.GL_TRIANGLES, 0, Buffer.capacity() / 5);

Clearly, this will no longer throw an error on glGetBufferSubData - but it should still fail (this was my original code before debugging, btw).
And: It still fails, but WITHOUT an error message from DebugGL. Any ideas why OpenGL would fail so hard that no even the debugger gets a glimpse on what went wrong?

As always, thanks for any tips

EDIT: @theagentd: Even if I remove the *4 in glGetBufferSubData I still get the same error. The error seems to be caused by the state of the VBO or something like that, not the call itself…

lhkbob · February 24, 2012, 8:45pm

Offsets, strides, and the contents of VBOs are not checked by the debugger. As an example of a contrived situation, I can have a texture vbo with half the elements of the vertex vbo. If both are configured as vertex attributes, and I attempt to render all of the vertices, the driver will start pulling in “texture” information from past the end of the shorter texture vbo.

Depending on how the layout of the vbo’s are, you can walk into garbage vbo information, or get access violations which cause the JVM to crash. That is the case you’re seeing, and why it tends to be unpredictable. Although that’s the cause, I unfortunately don’t have much advice for solving it accept to very carefully walk through the rendering and make sure the values passed to OpenGL are what you’d expect.

bk0 · February 24, 2012, 9:10pm

Ah, thank you. That at least explains why I don’t get an error message

I’ll try working out why I can’t get data with glGetSubdata(), the error is probably hidden inside the VBO data in this case (as you said). I’ll let you know what I screwed up when I find it.

bk0 · February 25, 2012, 2:37pm

Found it.

Due to a logic bug, 2 chunks in my world got the same VBO handle (say, 292). I chunk put x bytes into the VBO, the other y bytes. Of course this crashed

Thanks again for the invaluable help. Especially that thingy with DebugGL, I have a feeling this will help me much in the future…