z-sorting versus z-buffering

Just saw a tweet by Notch:
“z sorting lots of sprites is slow. z buffers are faster. z sorting when you have space partitioning is even faster as the buckets are tiny.”

When needing the z-layer in a landscape like Minicraft or Pokemon, I’ve always just sorted a list every frame. Maybe this is not what this is stating, but is that not the fastest way? What are z-buffers, and how do we use them?

Does anyone mind elaborating on this, very sparesomely explaining tweet?

Z-buffering is a technique used in 3D rendering which allows you to render your stuff in any order you want, and only the stuff closest to the screen will be visible. It works by keeping a z-value for each pixel, and every time a polygon covers a pixel a “depth test” is performed and only if the new sample is closer to the screen than the already stored sample will the pixel be updated.

He’s talking about using the depth buffer (“z-buffer” is an archaic term for the depth buffer) vs using the painters algorithm and manually painting the furthest objects first, closest last. For fully opaque geometry, there’s simply no question, the depth buffer will run circles around everything else.

Sprites however, are rarely fully opaque: The problem comes with blending, which doesn’t interact well with depth testing, and for many blend modes will require z-sorting to look right (particle systems are the big culprit here).

There are also advanced techniques (over my head) using the depth buffer to render “soft particles” (read: fire and smoke) but I don’t think he was talking about that, and you can’t really get away with z-sorting every smoke particle anyway.

when u have static spirits you can presort them quite easily.
If you have for example your spirits layed out only in 2D, there are only 8 possible ways of sorting them, you can even half this number cause of symmetry. In 3D you would have 32 => 32/2 = 16 sorting orders.

Also, when you have something where the sorting order changes very slowly. Presorting can be faster then anything else. Remember that you won’t have to resort them if the camera changes.

Maybe you could just do a single (or a few) iteration of bubble sort each frame and hope that it gets sorted well enough?

I’m doing a lot of GPU sorting at the moment.

On GPUs radix sort is the fastes way to sort something, there are plenty of opensource implementations out there for CUDA and OpenCL(i.e. libCL). You can sort around 500mio Elements per second.

Also, what I do like to do with particles. When I have a lot of particle Emitters in my World, normally you get a bunch of particle chunks in your world. Like if you have some fireplaces, you can sort all particles hierarchically.
Each fire emitter has a bounding volume. Then you can sort the volumes first and after that the particles inside of each volume. When some volumes intersect you can merge sort them together.

ps: merge-sort can be done in parallel like bubble-sort when the merging step is done with pivot buckets

Very interesting! I’ll definitely try to implement a radix sort in OpenCL some time! 500 million elements per second is probably enough, but that mostly depends on if that’s a high-end card or a performance card. I mean, with 2 million particles I’d need to sort 120 million particles per second at 60 FPS, which is about 1/4th as many as the GPU can sort if it works solely on sorting particles (right?). I also need to render them, which obviously needs some CPU time too. Since I can update and render my particles in about 14-15ms I only have 1-2ms over for sorting. If radix sort scales linearly with the number of elements as the Wiki article says then 120 million objects should take around 1/4th as long time as 500 million elements = 1/4th * 16.666 = around 4ms. It seems like getting 2 million sorted particles running at 60 FPS is impossible on my laptop by these numbers, but I’d still like to try. xD

You also mentioned a parallel merge sort. I’ve written a single-threaded CPU merge sort, but how in the world can it be implemented with threads/OpenCL? What are pivot buckets?

Merge-sort is trivial to multithread (although I don’t know anything about OpenCL)

You just divide your set in N parts, sort them concurrently (using any sorting algorithm) and push the results into a queue.
Another (set of) thread(s) pops off pairs of sets, merge the ordered sets and push the result back into the queue.

In the end, you have 1 ordered set left.

Yes, I see how different I can merge different pairs of sets in parallel, but at the end I just have two huge sets to merge. How do I merge them using multiple threads?

The last step is so fast that it doesn’t really matter that only 1 thread is merging them.

My reasonably informed guess is that memory is the bottleneck anyway. Throwing a bunch of threads at it won’t solve that problem.

If I have 2048 stream processors to keep busy, the last step is sure as hell a bottleneck. Memory bottlenecks can be improved by threading too if your CPU has hyperthreading. I’ve gotten a slightly OVER 2.0x scaling in memory limited programs on a dual-core! =D It should be because even if one thread stalls because of a slow memory fetch from RAM it can work on the other one.

I’ve got a radix sort running. How do I thread it? =D

When you have any kind of merge algorithm to merge two groups into one with one thread.

And now you want to merge two big groups with more then one thread you have to partition your two group in as many sub groups as threads you want to merge with.

And then merge these new pairs of sub-groups with your normal merge algorithm.

Little example, hope I can explain it.

group1:
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

group2:
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

in this example i want to merge with 4 threads, so i have to split up each group int 4 sub groups.

I do this like this, first:
from each group i pick 3 values, which partition the groups equally
group1: 7 15 23
group2: 8 16 24

now i sort these 6 values and i get 7 8 [color=red]15[/color] 16 23 24

Now i want to merge the 1st sub-group of group1 with all elements of group2 which are smaller then 7.
I know that those values from group2 will be in the first sub-group of group2, because of the sorting of the partition values. So i get with a binary search the index of the last value which is smaller then 7 from that sub-group. and merge now those two sub-groups and put them at the beginning of the output.

I do this with each sub-group of group1 and copy the remaining elements from group2 to the output.

To get the index of the sub-group of group2 in which to search for the border value from the partitioning of group1. One can just subtract the sorted index of the border value of the sub-group from its index in the group.

i.e. the border 15 from group1 has the index 1 in group1 but in the sorted list the index is 2. So 2-1 = 1 tells us that we will find the biggest value of group2 which is smaller then 15 in the 2nd sub-group of group2.

This should scale very well, because the overhead from sorting the border values and the binary searches are very small, because the number of values to sort/search are very small.

When doing a merge-sort the number of groups half each merge so one would double the number of sub-groups for each merge to keep all threads busy.

@Danny02
Awesome! I’ll need some time to let this sink in though… I think I understand some of it. Maybe.

The problem is that many advanced sorting algorithms become less effective in smaller sets. If you’re really dealing with sets with a size of 6, you probably want bubblesort, eventhough it’s O ( n2 ).

Maybe it wasn’t clear enough, but I was speaking about a CPU solution. :slight_smile:

Of course one would tackle such little problem sizes like 6 with something like an insertion or bit-tonic sort or an perfect sorting network.

On the other hand on would use the parallel merge of course only for bigger Problems, because
its only purpose is to let you sort a very large set by effectively splitting the big problem in smaller chunks and keeping everything busy while doing it.

Radix sort also depends on some kind of separating the problem in buckets which makes it also suitable for parallel sorting.

ps: I don’t have exact values, but one wouldn’t sort sets smaller then a fewer 1000 elements on a GPU or in parallel anyway :slight_smile:

Well, this is a lot better than when sorting gets less effective the more elements you have. :wink:

If you’re sorting something with less than a few hundred elements in it why worry about the performance anyway…

FWIW Radixsort for the win. Don’t bother threading it, it’s already fast as hell.

Cas :slight_smile:

Inb4 you realize I want to sort on the GPU… I’d prefer to use all of my 192 stream processors, not just 1. The Radeon HD 7970 even has 2048 stream processors… =3

Why not use them for something else?

… and how come you’ve got so much stuff to sort anyway?

Cas :slight_smile: