Mithrandir's state sorting post

mithrandir · October 20, 2004, 1:41pm

After doing your transparency sorting, when dropping in the objects for transparency rendering, continue to apply the basic state sorting to it. Large outdoor scenes that have hundreds of the same object (eg tree) don’t need to go into texture state changes most of the time. We’d turned state sorting off completely for the transparent objects (as is recommended by every text that you read about it). Using a partial sort test after the depth sorting made a huge difference.

Edit: Doh… button, wanted preview not send…

Anyway, what I was saying… We run a lot of the Planet9 data on our applications for various things. In the Elumens dome mode (which does 3 or 4 render passes per final output pass) we jumped from averaging 20 FPS to 33FPS just through this simple change. In non-dome mode we’ve jumped from 65FPS to 100+FPS. These are the really big outdoor scenes like Washington DC with the 2Kx2K textures etc. Smaller stuff like the various downtown locations (San Diego, San Fran etc) got less of a speed boost because there’s less transparent objects. In the D.C test, about 50% of the objects have some form of transparency.

princec · October 20, 2004, 2:36pm

Interesting insight.

Cas

shawnkendall · October 20, 2004, 3:25pm

[quote]Just use whatever collection is suitable (Set, ArrayList, whatever) and use Collections.sort(). Then either write a Comparator class to sort your rendering objects or build it in to the primatives themselves.
[/quote]
BTW, we did this in the engine in the book, and used a ZDepthComparator class for comparing - you guessed it - Z-depth (this is for transparency sorting). It worked famously and was very fast in our all alpha and mixed alpha scene tests.

princec · October 20, 2004, 5:43pm

I’ve got a triangle sorter buried in SPGL somewhere. I tried using Collections.sort() and comparators and it was basically dog slow. I instead went for radix sort and it was an order of magnitude faster, especially as the number of triangles increased.

Cas

mithrandir · October 20, 2004, 7:42pm

Well, you need to compare like algorithms. The collections built in sort() is just qsort, which is the fastest implementaiton of a comparsion sorter (O(nlogn)). Radix sort is not comparison based, and thus can be a linear-time sort (O(n)). It’s not an order of magnitude faster, but there’s a non-linear difference in speed as the number of objects increases.

Depth sorting, because of the single variable, is easy to turn into a linear sort using any of the counting-based sorting algorithms. However, state sorting is another thing entirely. I’ve not yet come across any way of generating a meaningful hashing strategy that would allow the use of a counting-sort for state sorting.

Orangy_Tang · October 20, 2004, 8:11pm

[quote]However, state sorting is another thing entirely. I’ve not yet come across any way of generating a meaningful hashing strategy that would allow the use of a counting-sort for state sorting.
[/quote]
I always get my sorting algos confused, but I don’t see why you can’t crowbar rendering state into a format suitable for radix sort? Just packing various state ids into one larger 32bit value, something like:
SSSS PPPP SSSS RRRR
where S is your bits for shader id, then primary texture, secondary texture, then specific rendering state (alpha test etc.). I think a single 32 bit value would be enough (unless you’ve got silly amounts of textures?).

Feel free to point out the great big gaping hole in my idea now.

princec · October 20, 2004, 8:23pm

No need, that’s exactly how I’ve done it, and it works well.

Cas

shawnkendall · October 21, 2004, 12:43am

If you limit the number of possible render state combinations to something like a palette system, then you can bin objects by your palette id in something like bit wise blocks but…

In practice, that is just too limiting for the general case. Evetually you’ll want to do it with “shader” objects (shader code, support texture and vertex data etc) then 32-bit index palette should work pretty good BUT…

You still want to minimize the state changes between shader changes and often trying to get that to happen too can make the original bulk shader objects look like they should be split into render state chunks that can be mixed and matched.

Again in the truly general case, it’s pretty ugly to get near optimal state sorting. If you have a domain specific engine (i.e. terrain, characters etc) you can force some short cuts, which is why you still see things like terrain rendering engines as a seperate part of render engines.

[mod spelling]

mithrandir · October 21, 2004, 1:02am

OrangyTang, it’s close to impossible to do it meaningfully as far as I can tell. For example, let’s look at material state itself.
Firstly you have say at least 4 different colour values - emissive, ambient, difffuse, specular, two sided and a few others. So you need a minimum of 4 bits for say 4 colour values that can be set. Each of those colour values is a floating point number, and has 4 of them. Now you can hash a single floating point number into an in (between 0 and 1) bot now you have 4 per colour that you need to hash. Why? Well, you want to state sort for any material state change. Just changing the emissiveColor and leaving the other 4 alone is a costly state change, let alone change all 4.

One of the primary goals of state sorting it is minimise the amount of change that you need to make when you sort. Thus, ifyou have three material objects where two have the same emissive and a different two have the same diffuse, then the most efficient state traversal is to start with one pair, put the middle that has the common emissive and diffuse, and then final object is the one that didn’t have the first state. Trying to encapsulate 4 different state values for two different floats as hashed values already blows you past the 32 available bits - even if you decided to cheat and quantise float colours to 8bit int (that’s still a single 32bit int for a single colour). Now, you’ve got hundreds of other states to start considering for state sorting - you can’t just compare texture objects, for example. A lot of stuff ends up actually using the same texture bytes, just with different filtering states or blend or wrap etc etc. So you end up needing to, potentially compare thousands of different things every frame. In the end, it just doesn’t work out into anything sane, so comparison-sorts are the only way to do this. Luckily, using good culling strategies, you can minimise the amount of stuff that you have to sort, so the O(nlogn) timing doesn’t really hit that hard for reasonably culled scenes.

princec · October 21, 2004, 6:53am

[quote]In practice, that is just too limiting for the general case.
[/quote]
In practise there is no general case, there is a specific case to solve. For any given instance of a major scene graph node I decide on how many kinds of state there are to worry about ahead of time.

[quote]Trying to encapsulate 4 different state values for two different floats as hashed values already blows you past the 32 available bits
[/quote]
My radix sorting is already based on 128 bits in little ol’ Alien Flux! The radix sort algorithm is very cunningly adaptable.

I know it’s not a perfect solution but the sort speed vastly outweighs occasional overlap in GL state changes. Many GL state changes are actually much less expensive than you might think. The only truly expensive ones are texture binds as they may incur a texture upload and otherwise zap any hardware memory cacheing.

I realise my fixed-known-problem approach may not cut it for a truly general purpose scenegraph but it would work well for a situation such as, say, Doom3.

Cas

Spasi · October 21, 2004, 10:06am

Packing states in ints should not be necessary in a general case. You can take advantage of the radix sort properties and call it successively, sth like:


radixReset(); // Reset the indices
radixSortFloat(objects, distanceKey);
radixSortInt(objects, state0);
radixSortInt(objects, state1);
...
radixSortInt(objects, stateN);
radixApplyIndices(object); // Apply the final indices

It looks a little ugly though, because you need to modify the algorithm to accept “Sortable” objects that can be sorted upon a certain “sort key” (distanceKey, state0, etc, in the above pseudo-example).

We use sth like this in Marathon, it’s faster than anything else we’ve tried and avoids the need for packing states.

shawnkendall · October 21, 2004, 10:11am

I guess is depends on what your practice is. Java3D is a general case graph and renderer, and does “unlimited” (not limited to palettes) state sorting, just like Open Scene Graph, Performer, OpenInventor, RenderWare, etc.

[mod]
Additionally, I can say within one execution session, our apps have switched quite drastically the number and arrangement of Appearance states, and therefore needed a general solution. Our 3D worlds are heavily data driven, giving as much control as possible to the content creation people, emulating that trend in professional game dev.

Already our artists are making the actual shader code in Maya with shader authoring tools. Very soon the “general” case will be the only case for pro dev, game middleware, etc, etc. (It’s already there for companies like E.A. - they bought Renderware are using it across there new line of games)

mithrandir · October 21, 2004, 12:55pm

[quote]Many GL state changes are actually much less expensive than you might think.
[/quote]
Right… and when did you last benchmark your code. In order to determine the best orders for sorting within each object, we took every class that we had in the scene graph and individually benchmarked each state change. Compared to setting it once and never changing again, versus setting every object, you get a minium of a 20% performance penalty - not including the differences of method call overhead into the basic object itself. A lot of this cost is the OpenGL call cost - that is, the pure fact that you’re even making the function call in the first place, when there was no need to (not to mention the JNI overheads for JOGL). If you’re only rendering under 100 objects per frame, then you probably won’t notice this, but for us, we’re typically rendering 5-10K objects every frame, even after culling and state sorting. Any state change is bade because that means extra objects on the render queue, so extra iterations of the render loop, more data to copy across every frame, more method/function calls to be made etc. There is far more to state sorting than just the simple opengl function call cost that you make it out to be.

princec · October 21, 2004, 1:02pm

Unfortunately that benchmark is invalidated under many conditions One version of a driver to the next, or one vendor to the next, or one OS to the next… the number of permutations is horrendous. So instead I have to apply a little “lowest common denominator” logic to it (eg. what’s the worst case? Uploading 12mb of textures? etc.) You can also wrap your states up clientside with a boolean guard and attempt to track them and ignore dud state changesthough this can be fraught with difficulty if you don’t have very thorough control of what’s going on and I suspect many of the later drivers might already optimize this case.

So you’re dead right but in practise you can get away with some leeway as the variations to be found in the field can invalidate carefully benchmarked conclusions.

Cas

mithrandir · October 21, 2004, 4:15pm

We have about 10 different video cards here from 3 different manufacturers - ATI, nVidia and 3DLabs. There’s 3 different desktop boxes, including 2 multiple CPU machines (one is a dual Xeon, running hyperthreading, so the appearance of 4 CPUs) and 3 different laptops (obviously these are fixed video cards so the combo can’t be tested). We got a darn good sampling of what the differences are and there isn’t much between them. The relative costs are still about the same between states. The absolute values are going to be different, but it’s the items that are more expensive on one card/cpu combo is stil the same “more expensive” on a different CPU/card combo. When state sorting, it’s the relative factors that matter most, not absolute.

Orangy_Tang · October 21, 2004, 5:35pm

[quote]The relative costs are still about the same between states. The absolute values are going to be different, but it’s the items that are more expensive on one card/cpu combo is stil the same “more expensive” on a different CPU/card combo. When state sorting, it’s the relative factors that matter most, not absolute.
[/quote]
Gonna tell us what they are then?

I only tend to sort on shaders then textures, but then again I’m generally using the same setup for all objects and don’t tinker with things like material colour much. I’d guess thats generally true for most games.

cylab · October 21, 2004, 9:01pm

Both can be true. If you hava some specialized application (game) where you are able to make some assumtions prior to rendering you can get away with state-ids and there might even be circumstances where the performance gain from using a faster sort algorithm based on such assumtions can outrun the performance gain from “proper” state sorting. Thats of course an opinion and cannot be proven with numbers right now. But it seems logical, since there a two interdepent variables in this equation…

I also tend to the position, that one should enforce “shortcuts” in rendering where ever applicable. I doubt shawnkendalls predication that “the general case will be the only case for pro dev”, or at least my definition of “pro dev” may differ But taken into account what seems to be the very nature of computer graphics - faking images with a lot of tricks, so that you have the illusion of real physics - it is… well… natural to tailor your application to the one special case so that you can get the best out of it.

See, i’ve got a job in J2EE development and if you know EJBs, you know that there are lots of “right” things to do that are a f**** pain in the ass, just to be prepared for an eventual requirement in the unforseen future…

I am not saying that EJBs are crap, they have their usage, but simpler approches have them, too. In my opinion this can be projected to software development in all areas, including computer graphics.

cya
cylab