Performance

WhiteHexagon · July 4, 2006, 10:31pm

Hi All,

I’m just testing out a profiler to try and get some more performance out of my game. I can already see that I need to add some more Display Lists for parts of the rendering, but one thing that surpised me was the following:


method name                                             time(ms)         invocation count
Model.draw(GL, int, int, int, boolean)                   24,687   93 %    426,877 
  com.sun.opengl.impl.GLImpl.glBegin(int)                 4,937   19 %    2,561,262   
  com.sun.opengl.impl.GLImpl.glEnd()                      3,265   12 %    2,561,262   
  MyUtils.calcNormal(float[], float[], float[])           2,812   11 %    2,561,262   
  com.sun.opengl.impl.GLImpl.glNormal3fv(float[], int)    2,453    9 %    2,561,262   
  ...some other calls

Is glBegin and glEnd really so expensive? It seems so, anyway I can fix this issue no problem, I just thought the numbers were quite interesting.

Cheers

Peter

emzic · July 5, 2006, 8:15am

what profiler did you use?

Spasi · July 5, 2006, 8:27am

In immediate mode rendering the real work happens on glBegin and glEnd. Usually everything in between is buffered and submitted as a batch at glEnd. The glBegin overhead is probably GL state validation, pipeline flushing, etc.

WhiteHexagon · July 5, 2006, 8:28am

YourKit - its a bit pricey, but they do a 30day evaluation for free

WhiteHexagon · July 5, 2006, 9:07am

Thanks Spasi, that would explain why none of the other gl methods show up too but it makes it kinda hard to know whats slowing things down.

Anyway I got my scene rendering 3x faster already, but maybe someone has some tips for a novice OGL programmer

The specific scene I was having problems with was displaying only around 5600 lego style bricks. Not that much to ask I thought, but It was really killing my fps.


loop 5600 times:
    //draw solid brick
    gl.glPolygonMode(GL.GL_FRONT_AND_BACK, GL.GL_FILL);
    drawSingleBrick using GL_QUADS and textured.
    //draw wireframe to highlight the edges in black
    gl.glPolygonMode(GL.GL_FRONT_AND_BACK, GL.GL_LINE);
    drawSingleBrick using GL_LINES

I’ve now split this into two iterations (one for solid drawing, one for wireframe highlight drawing) which each gets compiled into a display list. Then I just call the two display lists. This approach seems to tripple to performance, but I know that some of this is because of the normal calculations and color lookups only being done once during compilation.

But I’m still a bit confused. I would have thought that using a display list that the code is then on the gcard, but my CPU is still maxing out at 98% during rendering. Is JOGL running these lists on the CPU? Should I be using VertexArrays for this type of stuff? or would I have the same problem?

Any tips are really appreciated.

Cheers

Peter

java (build 1.5.0_06-b05)
jogl beta4
Win2k
GL_VENDOR=NVIDIA Corporation
GL_RENDERER=GeForce 6200/AGP/SSE2
DRAWABLE_GL=com.sun.opengl.impl.GLImpl

bahuman · July 5, 2006, 10:15am

I’m curious, what is your current FPS?

Oh, and maybe this might get you on your way:
http://www.opengl.org//resources/faq/technical/performance.htm

What is the opengl code for drawing the lego block? How many triangles does it contain ?

WhiteHexagon · July 5, 2006, 11:42am

Well it’s quite a low spec gcard, but I was down to 7fps, back to 20fps now after the changes, but I think I’m CPU bound for some reason.

Thanks for the link.

This is the code for drawing the solid bricks: but I call the same code but use GL_LINES

        
        gl.glBindTexture(GL.GL_TEXTURE_2D, textureStud);
        gl.glEnable(GL.GL_TEXTURE_2D);
        gl.glPolygonMode(GL.GL_FRONT_AND_BACK, GL.GL_FILL);

 loop 5600 times
                        float tx = drawQx + (QUEST_TILE_LENGTH * x);
                        float ty = drawQy + drawY;
                        int height = mapData[qx][qy].height[x][y];
                        float tz = ClientConstants.STD_BRICK_HEIGHT * height;
                        gl.glColor3fv(questColorTable[height+ -CorkConstants.MIN_MAP_HEIGHT], 0); //adjust height to be zero based.

                        // solid brick
                        gl.glTranslatef(tx, ty, tz);
                        gl.glBegin(GL.GL_QUADS);
                        drawQuestTile(gl, false);
                        gl.glEnd();

                        gl.glTranslatef(-tx, -ty, -tz);

And this is for a single brick

    
private static final void drawQuestTile(final GL gl, final boolean wireframe) {
        float BASE = 0.0f; // base height...
        float HEIGHT = STD_BRICK_HEIGHT;
        float WIDTH = 4;
        float LENGTH = 4;
        float[] normal;

        // front
        normal = Utils.calcNormal(new float[] { 0, 0, BASE }, new float[] { LENGTH, 0, BASE }, new float[] { LENGTH, 0, HEIGHT });
        gl.glNormal3fv(normal, 0);
        gl.glVertex3f(0, 0, BASE);
        gl.glVertex3f(LENGTH, 0, BASE);
        gl.glVertex3f(LENGTH, 0, HEIGHT);
        gl.glVertex3f(0, 0, HEIGHT);

        // back
        normal = Utils.calcNormal(new float[] { LENGTH, WIDTH, BASE }, new float[] { 0, WIDTH, BASE }, new float[] { 0, WIDTH, HEIGHT });
        gl.glNormal3fv(normal, 0);
        gl.glVertex3f(LENGTH, WIDTH, BASE);
        gl.glVertex3f(0, WIDTH, BASE);
        gl.glVertex3f(0, WIDTH, HEIGHT);
        gl.glVertex3f(LENGTH, WIDTH, HEIGHT);

        // w end
        normal = Utils.calcNormal(new float[] { 0, WIDTH, BASE }, new float[] { 0, 0, BASE }, new float[] { 0, 0, HEIGHT });
        gl.glNormal3fv(normal, 0);
        gl.glVertex3f(0, WIDTH, BASE);
        gl.glVertex3f(0, 0, BASE);
        gl.glVertex3f(0, 0, HEIGHT);
        gl.glVertex3f(0, WIDTH, HEIGHT);

        // e end
        normal = Utils.calcNormal(new float[] { LENGTH, 0, BASE }, new float[] { LENGTH, WIDTH, BASE }, new float[] { LENGTH, WIDTH, HEIGHT });
        gl.glNormal3fv(normal, 0);
        gl.glVertex3f(LENGTH, 0, BASE);
        gl.glVertex3f(LENGTH, WIDTH, BASE);
        gl.glVertex3f(LENGTH, WIDTH, HEIGHT);
        gl.glVertex3f(LENGTH, 0, HEIGHT);

        // bottom
        normal = Utils.calcNormal(new float[] { 0, 0, 0 }, new float[] { LENGTH, 0, 0 }, new float[] { LENGTH, WIDTH, 0 });
        gl.glNormal3fv(normal, 0);
        gl.glVertex3f(0, 0, BASE);
        gl.glVertex3f(LENGTH, 0, BASE);
        gl.glVertex3f(LENGTH, WIDTH, BASE);
        gl.glVertex3f(0, WIDTH, BASE);

        // top
        normal = Utils.calcNormal(new float[] { 0, 0, HEIGHT }, new float[] { LENGTH, 0, HEIGHT }, new float[] { LENGTH, WIDTH, HEIGHT });
        gl.glNormal3fv(normal, 0);
        if(!wireframe)gl.glTexCoord2f(0.0f, 0.0f);
        gl.glVertex3f(0, 0, HEIGHT);
        if(!wireframe)gl.glTexCoord2f(QUEST_TILE_STUD_COUNT, 0.0f);
        gl.glVertex3f(LENGTH, 0, HEIGHT);
        if(!wireframe)gl.glTexCoord2f(QUEST_TILE_STUD_COUNT, QUEST_TILE_STUD_COUNT);
        gl.glVertex3f(LENGTH, WIDTH, HEIGHT);
        if(!wireframe)gl.glTexCoord2f(0.0f, QUEST_TILE_STUD_COUNT);
        gl.glVertex3f(0, WIDTH, HEIGHT);

    }

Spasi · July 5, 2006, 2:37pm

There are three reasons for the low performance you’re seeing:

drawQuestTile is making immediate mode calls (glVertex/Normal/TexCoord), which is the slowest way to submit vertices. You’re creating a lot of arrays too, which contribute to bad performance. With display lists, consider this problem solved.
You’re submitting too many low polygon batches. 5600 objects is a big number, even for a high-end CPU. The overhead of each draw call is considerably larger than the GPU effort to render six quads. The GPU is basically sitting idle and waiting for the CPU. You may be able to solve this by packing groups of bricks (say 100 at a time) in a vertex array and drawing all of them at once.
GL_LINE drawing is not hardware accelerated on consumer-level GPUs.

Spasi · July 5, 2006, 2:44pm

For more details about #2, google for “instancing”. It’s the method provided by Direct3D to solve this problem. OpenGL does not support it because GL’s overhead is generally much lower than D3D’s, but it’s still a problem in situations like yours. A technique called “pseudo-instancing” can be used in OpenGL, but that requires vertex shaders, which is probably too advanced for you right now.

WhiteHexagon · July 5, 2006, 3:48pm

Thanks for the great feedback!

#1 The code I posted also now has two display lists wrapping it, one for the solid bricks and one for the wireframe, that’s where I got my first increase from 7fps to 20fps, but still suffer 100% CPU load. Maybe I try array lists instead? or do you think this might be a jogl issue? From my reading I thought Display Lists were compiled on the GPU and then used directly from there, so should’t I be getting almost 0 CPU load?

#2If I’m using Display Lists or Array Lists I assume this is no longer an issue?

I’d like to look more at vertex shaders in the near future, sounds very powerful, but my priority is to get something basically playable and then start to improve it. I didn’t want to worry to much about the performance, but the client has gone down from 60fps a few months ago, to yesterdays low of 5fps, so I thought I’d better take a break and find out where the problem was before I go to far down the wrong path. The info here is really helping, thanks.

#3 Interesting. Is there another better technique for highlighting the edges of the bricks instead of just drawing a black wireframe over the brick. (it doesn’t seem to work very well anyway because of the zbuffer unless I make the line width equal to 2). You can see the effect I have on my front page: http://whitehexagon.com

Thanks

Peter

Spasi · July 5, 2006, 5:44pm

[quote=“WhiteHexagon,post:10,topic:27836”]
Yes, DLs are compiled and (probably) stored on the GPU and they are very fast. You’ve solved this problem, I just wanted to help you understand why this was an issue before moving to DLs. #2 is your big problem now.

[quote=“WhiteHexagon,post:10,topic:27836”]
Unfortunately it is. Actually, it’s a problem no matter how you’re rendering (DLs, VBOs, vertex arrays). It is caused because of the pipelined way GPUs work. Each time you make a draw call, a lot of stuff happen (from state validation to, worst case, pipeline flushing). The overhead of each such call piles up to the point that the CPU struggles to keep up with the GPU (and usually fails). The problem isn’t how you’re rendering, but the massive number of 5600 draw calls.

FYI, most of the redesign in DX10 was done because of exactly this problem. Even the new geometry shaders, except the unique possibilities they offer, are meant to improve this situation.

So, you have to accept the fact that you can’t possibly make 5600 draw calls. You either design your game around that, or use techniques like the one I described in my previous post. IIRC, current GPUs work better in batches of more than 500-1000 triangles and the number of draw calls should be lower than 1000 (there are certain papers that have exact numbers).

Yeah, I know what you mean. In our terrain editor, I’ve solved the artifacts problem with a vertex shader (can be done without VS). I’m just pushing the terrain grid by a small amount towards the normal of each vertex and it looks great. I couldn’t be bothered to search for non-GL_LINE line rendering (I’m using lines only in the editor for the terrain grid and debugging), but I think there are certain techniques you can investigate (using clever texturing IIRC).

darkprophet · July 5, 2006, 5:50pm

You could use a vertex/fragment shader couple to do edge detection and blend that over the scene…GPGPU has a few code snippets about edge detection filter on the GPU

DP

kitfox · July 5, 2006, 6:29pm

I’m working on a terrain editor myself. At the moment, I’m sending everything to the graphics card with individual calls to glVertex, glNormal and glTexCoord, and getting a pretty decent frame rate. Even when I throw in GL_LINEs to highlight the edges in my editor, the performance is reasonable. I don’t think I have a super speedy machine, but 3200 triangles the slow way seems to be working for me.

Anyhow, I wrote it this way just to let me start debugging things quickly. I plan to move everything to VBOs. Now, will my VBO render faster if I write an algorithm to stripify my terrain, or should I just leave them as individual triangles?

I’m implementing a ROAM style algorithm, and it seems to be working, but I’m getting odd artifacts on the tesselated terrain. When smooth shaded with normals and a solid color, there are these star shapes surrounding concave or convex verticies. I’m pretty sure the normals are correct, but having these shapes in an otherwise smooth terrain breaks the visual continuity. Do I need to tweak the normals somehow?

I’m also curius about your idea of using a vertex shader to display gemoetry. Does this mean you would just upload a square grid of points as a single object and write a clever vertex shader to fold it into a terrain shape? Does this really give faster performance? How to you adjust for level of detail?

WhiteHexagon · July 5, 2006, 9:21pm

#2 For those numbers, what do you define as a draw call? is one draw call = one gl.glDoSomething method? or one Begin End block?
Would spliting the display list into 5 smaller lists help in anyway? or is it really down to the number of gl.glDoSomething calls inside the display list. Since I guess that is currently 5600 * number of calls in drawQuestTile (35ish?) = 196,000.

I think I’m missing something here, so I shall do some more reading on this topic because I think I need somehow to solve having this many bricks on screen, if not for the landscape then for sure once I have all the other scenery and creatures on screen. Maybe I will also try some VertexArrays here, I use them already for another part of the game and they seems quite performant. Since I only need one surface of the brick textured maybe I can also draw those surfaces seperately and render the sides of the bricks untextured. Lots of ideas… But it’s been a long day and my head is buzing with all this Thanks to everyone for the feedback. I shall have another read of this over the weekend and hopefully all will be clearer.

Niwak · July 6, 2006, 9:06am

To WhiteHexagon :
A draw call in your situation is “gl.glCallList”. It doesn’t matter how many glVertex,… are in the display list.
The thing you should minimize is the number of gl.glCallList. Therefore splitting display list is not a solution, you would make thing worse. Regarding your display list, there are “good habits” given by cards manufacturers that parhaps you are not applying. Here are some ;

dont perform state change in a display list (like glBindTexture, glTranslate,…). This can render the display list rather inefficient since it force the driver to perform a state validation when you call the display list even if you did not change the state.
use an uniform vertex format ; i.e. when you specify a vertex you should allways provides the same information (for example : a normal + a texture coordinate + a vertex) ; your are not doing this since in your snippet, normals are specified once per face, texture coordinates just for one of the faces, color seems to be one per model.

Anyway, your model is composed of only 6x4 = 24 vertices which is very low. I’m not sure using 5600 display lists is a very efficient technique. You could try to create one FloatBuffer, put in it interleaved data for all your blocks, when a block move, just update its coordinates directly in the FloatBuffer and submit this to the GPU with a single glDrawArray call. I think you would get fairly higher frame rate (at leats if not all block are moved each frame).

To KitFox :
I have spent some time implementing a terrain algorithm for my game. In this process, I initially tried ROAM. The result were that it was somewhat inefficient ; the fact that you have to generate a new index array for each frame with all the stripping problem made it too CPU intensive for my game. I have moved to a very straightforward system similar to geomipmapping which performs really well and was really easier andfaster to implement. So, before wasting too much time on ROAM, I would suggest to quickly try a brute force system like geomipmapping to see if it does not fit your needs.

Vincent

cylab · July 6, 2006, 9:08am

You could texture your quad using an image containing that highlighting lines.

WhiteHexagon · July 7, 2006, 9:54pm

Hi All,

I’ve done some work on this the last couple of days with some interesting results. I tried Niwaks idea of using a vertex array. For code simplicity I split the rendering into two VA. One for the brick tops with a studed texture, and another VA for the brick sides (including the edges drawn into the texture as cylab suggested).

So the results: when displaying just the tops of the bricks the VA run as I would expect, 5% CPU and around 60fps (I presume this is just limited by the vsynch rate of my TFT which is currently also 60). Is there a way to disable that? I remember with GL4Java there used to be a special call to disable the vsynch limit, does JOGL have something similar?

So when I try to display the second VA (the sidess) as well, the CPU jumps upto 100% and the frame rate drops to 30fps (but thats still better than the 20% from using display lists!)

So I took out the rendering of the tops of the bricks for now, and just have the single vertex array for the sides of the bricks to try and optimize that. I found that by adjusting the count parameter on glDrawArrays I could find some switchover point where I start to become CPU bound. I’m drawing 2025 bricks which worked out to about 32,400 vetices.


vetcices CPU%
10000	6
15000	15
17000	20
20000	25
21000	50
22000	90
23000	98

So it seems I can only draw around half the brick sides I need taking this approach. I was thinking that maybe I could use a QUAD_STRIP for the 4 sides which would reduce the vertex count from 16 down to 10 per brick.

Or I could even try and calculate a quad strip for a complete row of bricks across the whole map. question? Would I still be able to change the color of each face using a quad strip, or will I end up with just a mess of blended colors because the vertices are shared? How would that with textures if I wanted differnet textures per face. I realise I can have both textures in a single 256x256 texture and just cut out the piece I need for each face, but how would I specify this while drawing a quad strip, seems impossible?

Anyway overall things are getting better. I’m just still confused over why the VA rendering starts to impact the CPU performance so drastically, and not even linearly.

Cheers

Peter

Spasi · July 8, 2006, 9:50am

Hi WhiteHexagon,

I told you what to do in my second post:

[quote]You may be able to solve this by packing groups of bricks (say 100 at a time) in a vertex array and drawing all of them at once.
[/quote]
There are two pieces of information in that sentence, a) use vertex arrays and b) don’t pack all the bricks in a single VA, but rather groups of them.

bahuman · July 8, 2006, 12:27pm

WhiteHexagon:

So the results: when displaying just the tops of the bricks the VA run as I would expect, 5% CPU and around 60fps (I presume this is just limited by the vsynch rate of my TFT which is currently also 60). Is there a way to disable that? I remember with GL4Java there used to be a special call to disable the vsynch limit, does JOGL have something similar?

So when I try to display the second VA (the sidess) as well, the CPU jumps upto 100% and the frame rate drops to 30fps (but thats still better than the 20% from using display lists!)

So I took out the rendering of the tops of the bricks for now, and just have the single vertex array for the sides of the bricks to try and optimize that. I found that by adjusting the count parameter on glDrawArrays I could find some switchover point where I start to become CPU bound. I’m drawing 2025 bricks which worked out to about 32,400 vetices.

When you try the same measurement with only the top of the bricks, do you get the same result?

Also: drawing each brick separately is not the most efficient. You can easily optimize this, by constructing new display lists each time the user attaches a new brick to the construction. For example, if the user built a wall, you can put the entire wall in a display list, even if the layout of the wall may change within the next 30 seconds (another mouseclick). 30 seconds is -ideally- about 1800 frames, so you’ll have saved yourself a lot of transmits over the AGP (or PCI-e) pipe, even if it looks like a lot of code to execute. Once you decide to put an entire wall in a display lists, you could even cheat, and use less vertices than you would for every brick separately, as long as you tile your texture correctly! (if your texture coordinates wrap, rather than clamp, the texture will repeat itself).

WhiteHexagon · July 8, 2006, 10:09pm

Hi Spasi, I appreciate your help with this but I didn’t understand your earlier tip at first (see reply #13). But I’m learning slowerly I was going to split my display list and that’s where I got confused. I’ve now tried your approach of batching the VA data. My data breaks down into 9 chunks quite nicely so thats what I’ve tried first. I can see though that this is probably still too much data for the 100 or so items you mentioned, but now Im drawing less data and no outlines… so I’m drawing 9x225 bricks parts as below.

bind top texture
loop 9x:
    glDrawArrays[i] (225 single textured quads)

bind side texture:
loop 9x:
    glDrawArrays[i]  (225 x 4 brick sides (no base))

So i presume if I take this approach I’m only doing 18 ‘draw calls’? For the brick tops I presume that quads will be split internally into two triangles, so that would be 450 triangle per call, right? and 1800 triangles for the 4 sides. Which is more than the 1000 you mentioned. So as expected the CPU was still maxing out for the complete scene.

So next I tried to split the side drawing into two batches, east & north, and south & west.

bind top texture
loop 9x:
    glDrawArrays[i] (225 single textured quads)

bind side texture:
loop 9x:
    glDrawArrays[i]  (225 x 2 brick sides (no base))
    glDrawArrays[i]  (225 x 2 brick sides (no base))

That brings the triangle count down to 900 in each of those array lists. But sadly I’m still seeing a maxed out CPU and 30fps (the same as a single VA). DO you think I need to make these batches even smaller?

To bahuman: I could display all 2025 tops in a single VA at 60fps and 4%CPU. The problem was when I started drawing the bricks sides as well. Then I seemed to reach some switch over point where the CPU started taking load.

Thanks again for everyone thats helping out on this, I hope I can show something nice at the end of it all

Cheers

Peter