OpenGL Perf Tuning

Hi all,
In my opinion and limited experience, it seems that Java programmers make weird and wonderful algorithms and ideas to increase the performance out of OpenGL, while C/C++ programmers dont need to do that just yet; but when simple java benchmarks become as fast as C/C++ applications, Java’s scenegraphs and such will have all these cool new ways to do stuff and hence are going to be faster than the equivilant C/C++ applications.

So feel free to post here any cool ideas regarding how you got some performance out of OpenGL that you think is uber cool. My contribution is state sorting:

From what ive read (very little), it seems that state changes flag the server to do a complete validation of the server states and the client states so that they both conform, and OpenGL seems to take this time to also do some memory juggling and that can take a while. So, a good way to avoid this is minimise the state changes that do occur. The obvious solution is to have some sort of bucket, clear the bucket everyframe, use some sort of cool sorting algorithm to sort your objects, then loop through the bucket and render. All is well and the kittens are drinking milk…

However, all this sorting every frame can be a little computationaly expensive especially when you have alot of objects, and from what ive seen, the fastest sort is the Radix sort which has a performance of O(n). My Contribution has an init sorting time of O (n log m) whereby n is the number of objects and m is the number of states, but during run time (i.e. frame to frame), there is no overhead. This does come with a limitation, i’ll discuss that later.

As an example, lets say you have four states: Texture State, Lighting State, Pixel/Vertex State, and Depth state. They vary in their performance cost, but lets say for simplicity sake, that the above list is ordered from highest to lowest expense (nothing is better than actual costs, so do your own benchmarks and see which is more expensive)

Now during runtime you init a sort of tree, each node reflects a state change (i.e. an enable and a disable of that state) and the nodes contain a list of objects they hold (Look At the picture). You create your material with its fancy state changes and such, then you recurse the tree until you find that no more nodes below the current one support the material (a simple “isStateAvailable(Material)” method will do in each node that returns a boolean). In the diagram, if your object has Texture state, and lighting, the object will be stored in lighting node (on the left); As you can see, the tree is ordered from top-bottom and left-right. Now all of this is during init time, once you have found the tree, store the node of the render tree into the material.

When rendering, do “material.getStateNode().add(material);” for every object, then recurse the entire tree and draw the children at every node. No sorting during run time.

Now the only limitation of this is texture ID’s, glBindTexture can also flag a validation and the expense is the equivilant of state switch, the only way left to do is do a proper sort on the children of the Texture State node. Now the diagram is just a little sub tree, next to the texture you would also have Lighting with its Depth and Shader children as nodes describe a glEnable function.

Whats your clever cool idea/algorithm ?

DP

My clever algorithm uses RadixSort too, to sort a triangle pump. You pour in as many Triangles as you like, and each Triangle is associated with a RenderingState.

The triangle pump then sorts the triangles on rendering state but in no concretely defined manner, up to “n” times. The advantage of the radix sort algorithm is that it can be run sequentially to sort any number of times, y’see. The final sorting pass at the very least sorts triangles by their first index to generally get the best automatic stripification possible (hint to GL programmers: triangle strips are no longer worth doing; just use GL_TRIANGLES and you’ll find the hardware vertex cache does it all for you).

Because figuring out what state changes are expensive and what changes aren’t is largely guesswork in OpenGL, the trick is not to think too hard about it all. I think in terms of only 3 different levels: memory uploads to GL, being the most expensive - so that’s texture state changes basically; serverside state changes; then clientside state changes.

There are states above and below that which relate to the order in which things should be drawn. A simple integer for a rendering order, for example.

Code’s all buried in SPGL somewhere.

Cas :slight_smile:

I thought the reason for using triangle strips is to reduce the data you send down the pipe ? It would be around a third since you only need to define an extra point for a further triangle…

Anyways, keep them coming!

DP

If you think about how the driver actually sends triangles you’ll realise why strips are an anachronism: these days people don’t use glBegin(GL_TRIANGLES), instead they use glDrawElements using an array of indices. The GPU has the last few converted and transformed points available in a hardware cache and associated with an index number; so if you simply send a triangle 1, 2, 3 followed by another triangle 2, 3, 4, you’ll find that vertices 2 & 3 are already transformed and in the hardware cache, so the GPU doesn’t need any of the vertex data from the client apart from vertex 4 - in other words exactly the same operation as if you’ve sent a triangle strip, it just did the work for you. So sort on one of the triangle indices and you’ll find that 90% of your triangles are automatically stripified. Result :slight_smile:

Cas :slight_smile:

It’ll only be a third less data if you get perfect stripification. If you look at the output from Stripe thats obviously not likely. Every strip begin/end is going to add quite a bit of overhead as you either make another draw call or add in a whole bunch of degenerate tris.

[quote=“princec,post:4,topic:26859”]
90% sounds excessive. Finding the optimal indices order for any mesh is an NP-complete problem. There’s also the problem of different cache sizes on different GPUs.

The way I deal with this is a bit tricky, but might be interesting to some of you. When a dotXSI mesh is converted to our internal format, I calculate the average cache miss rate per triangle (ACMR). This is calculated by going through the indices, calculating how many misses we had (for a given cache size) and dividing by the triangle count (IIRC). The result is a number between 0.5 and 3.0 (the lower the better). Then I run the indices through a little program that is used for solving NP-complete sorting problems (can’t remember which exactly, will re-post when I get back from Crete). The result is a new indices order, that produces a new ACMR. This is done recursively until it can’t lower the ACMR any further (a user specified threshold applies here).

With the above technique I’ve seen models with a >2.0 ACMR fall to under 1.0. And even though the ACMR is calculated using a specific cache size, the final order is independent of it, so the optimization is general.

The nifty thing about simply sorting on triangle index A is that it can be done on the fly, is incredibly fast, amazingly simple to understand, and 95% as effective in 95% of situations :slight_smile:

Cas :slight_smile:

just wondering, would that work with lots of polys, isnt it a bit expensive to sort 10s of 1000s of tris every frame?

Nope, it’s a very inexpensive operation to perform considering the performance gains on the GPU, and the fact it’s just the last step in a chain of radixsorts (imagine you had 6 state layers, you’d add this sort at the end to make it 7, so total sorting time would be increased by 1/6th). O’course you might want to benchmark it to be sure.

Cas :slight_smile:

Link to paper and software: Universal Rendering Sequences for Transparent Vertex Caching of Progressive Meshes. I’m using the recursive cut algorithm, although the MLA one should be a little better.

Cas, you’re right, sorting on index A is amazingly simple and fast, especially if it needs to be done at runtime. But it doesn’t even come close to what the above algorithms can do. The difference is tiny on simple meshes, but my tests on more complex meshes (> 1000 triangles) showed significant gains. For example, I didn’t see any ACMR under ~1.5 with index A sorting, whereas the recursive cut could easily get around 0.9.

Highly recommended for offline use. :wink:

I bet the sort might even be more effective if you sort on the mean average of the vertex indices as well. Anyway - it’s the fastest technique I can think of that works in realtime. Very handy if the entire scene is constructed from dynamic geometry.

Cas :slight_smile:

I was thinking of having 2 of such sorted lists of openGL objects: One for opaque objects, and one for blended objects.
My reasoning was that you need to draw the blended objects always at last, so you only have to do a sort on depth on the blended object list and do other kinds of state sorting on the opaque object list. You can then simply draw the sorted opaque object list first, and the blended object list at last. Does this make sense?

Yes, that’s even pretty normal these days.

You can/should go even further, and coarsely sort front-to-back the objects in your opaque queue, so that overdraw is reduced (depth test).

This sorting should be like putting every object nearer than N meter in queue 1, and all others in queue 2.

Another tip:
If you have lots of small low-poly items (like grass, small rocks, etc) don’t render them on a per-item basis, but use the data from the vertex-arrays and the indices to construct a new geometry with new indices containing all items of that type. Relatively easy to do, and in my scenes gave a performance-boost of factor 20. If you want controllable waving grass (no wicked vertex-shader stuff), without uploading new data to the gpu, try hardware-accelerated keyframe-interpolation, which is also pretty easy to figure out.

hardware-accelerated keyframe-interpolation

Whats that? (any pointers are welcome)

Use your creativity to send two vertex-arrays and two normal-arrays to the gpu, and interpolate between them in a vertex-shader. :-*

P.S. Knowing there *is* a solution to a problem, makes the brain more creative in finding ways to discover it.

P.P.S
Don’t waste your time on google, as nobody seems to have figured it out and published it as of yet.

Now I feel cheated, because you wrote “no wicked vertex-shader stuff” :stuck_out_tongue:

Using that kind of shaders would ramp up the system requirements a bit, doesnt it? Can those intel integrated shitsets (oops typo) or something like a gf2 do it?

I meant ‘randomly’ offseting vertices, or some sin() function that normally looks very artificial. That was the “wicked” part.

And yes, it might ramp up the requirements, but heck, you can have a fall-back mode. With such gfx-cards, poly-count would normally be so low, that a cpu might as well do the interpolation and resend the vertex-array every frame, per item. Err, we were talking about stuff like animated grass, just make it static then.

Fallback modes = 2x the work. Far better to just stick to some nice minimum guaranteed set of specs and work to it. Well, maybe not better, just more effective use of your time. Or better still get someone at Xith / JME to do it :wink:

Cas :slight_smile:

Yea… I also think that there is very little point in making something work even faster on fast machines. Its not like anyone would care if they get 200 or 400 fps.

I just wondered about it, because Kev said that keyframe interpolation was quite the performance sucker… even if its only a handfull of lowpoly models and well, even Q2 managed to do that just fine on really lame hardware. So, I thought that there might be some better way to do it (w/o using anything fancy).

[quote]So, I thought that there might be some better way to do it
[/quote]
Compare inefficient javacode (working on arrays / vec-objects) to lean-and-mean SIMD enabled C++ code. Performance difference is roughly factor 2-3.
Edit: this is based on actual realworld cases, using native libraries in Java.

Try rendering >256 keyframe animated guys/trees, each with >2000 polys. Now the CPU will just be swamped (quad-core comes in handy) with LERPs, the GPU will laugh at it, easily pumping over 100fps. At least that’s what my fairly ‘old’ 9700pro thinks of it. It by far beats bone-based animation as the vertex-shader is only 6 lines of trival code (fetching data + 2x lerp (V+N)).

[quote]“Fallback modes = 2x the work”
[/quote]
It’s only a few minutes to setup the fallback mode: you already have the index of frame X and frame Y, and the weight (from the original code). You lerp the 2 FloatBuffers into a ‘global’ 3rd and update a VBO (or upload a VA). Double the peanuts.