z-sorting versus z-buffering

princec · January 25, 2012, 10:01am

See this for an MT RadixSort and see how you get on: http://www.codercorner.com/blog/?p=90

Cas

Roquen · January 25, 2012, 10:14am

Since particles have a high temporal coherence of sorted order between frames (as long as the camera’s not teleporting around) just a single pass of bubbling sort per execution unit seems more than reasonable, where for each odd execution the chucks are offset by 50% for the borders between chunks.

princec · January 25, 2012, 11:20am

The radix sort implementation I’m using seems pretty good at dealing with temporal coherence but I don’t know if it’s more or less efficient than a bubble sort. Then again I do more than just particles - the whole sprite engine goes through it and sorts on about 8 different factors. I think that the fact that it sorts about, hm 5000 odd sprites on 8 variables in something like a millisecond means it’s fast enough for my purposes though

Cas

Riven · January 27, 2012, 2:24pm

In Java7 the dual-pivot QuickSort algorithm has become the default implementation of Arrays.sort(…).

Having said that, I gave CPU sorting a try on this massive dataset:

data = new int[256 * 1024 * 1024]; // 1GB RAM
r.setSeed(1337L);
for (int i = 0; i < data.length; i++) {
   data[i] = r.nextInt();
}

Results:


Arrays.sort(data); // 31.4s
QuadCoreArrays.sort(data); // 12.2sec

The input is split in 4 ranges and sorted concurrently on 4 cores (5220ms).
Two pairs are merged on 2 cores (3599ms).
Final pair is merged on 1 core (3381ms).

It uses an auxilary buffer for the merging, doubling the memory usage. In singlethreaded code it would look like:

int[] aux = new int[data.length];
mergeInAux(data, data.length / 4 * 0, data.length / 4 * 1, data.length / 4, aux);
mergeInAux(data, data.length / 4 * 2, data.length / 4 * 3, data.length / 4, aux);
mergeInAux(aux,  data.length / 2 * 0, data.length / 2 * 1, data.length / 2, data);

31.4s -> 12.2s = 2.57x

theagentd · January 27, 2012, 3:20pm

That’s pretty awesome! Are you using Arrays.sort(…) to sort the data on each CPU? Multithreading the merging too would probably improve the scaling a bit more though.

Also, minor typo in the merging code you posted. The second argument on the second line should be data.length / 4 * 3, not 2…

Riven · January 27, 2012, 3:43pm

Yes, Arrays.sort(values, fromIndex, toIndex) is pretty much the most convenient thing to use in this case.

Multithreading merging is actually not that hard, but not trivial either.

You probably want to generalize the situation, ending up with something like:


public static void streamMerge(Queue<int[]> input1, Queue<int[]> input2, Queue<int[]> output, int outputChunkSize)

It would probably cut the final stage in half, shaving off 1.6sec from the total 12.2. It’s hardly worth the effort.

Typo indeed. This was not part of the actual code

theagentd · January 27, 2012, 3:50pm

What?! That’s like a 10-15% speed increase! It’s a crime not to implement it!!! xDDD

Riven · January 27, 2012, 5:03pm

I just tried. The overhead of the additional data-copy is so large that it’s slightly slower than not multithreading the merge.

Riven · January 27, 2012, 5:08pm

For your entertainment: (and maybe to point at the stupidity / performance hog…)


   static void streamMerge(SimpleBlockingQueue<int[]> input1, SimpleBlockingQueue<int[]> input2, SimpleBlockingQueue<int[]> output, int maxOutputChunkSize) {
      int[] data1 = input1.take();
      int[] data2 = input2.take();
      
      int[] aux = new int[maxOutputChunkSize];
      int off = 0;
      int read1 = 0;
      int read2 = 0;
      int end1 = data1.length;
      int end2 = data2.length;
      
      while (true) {
         
         while (((maxOutputChunkSize - off - 1) | (end1 - read1 - 1) | (end2 - read2 - 1)) >= 0) {
            int a = data1[read1];
            int b = data2[read2];
            if (a < b) {
               aux[off++] = a;
               read1++;
            } else {
               aux[off++] = b;
               read2++;
            }
         }
         
         if (off > 0) {
            output.put(Arrays.copyOf(aux, off));
            off = 0;
         }
         
         if (read1 == end1) {
            data1 = input1.take();
            if (data1 == null) {
               do {
                  output.put(Arrays.copyOfRange(data2, read2, data2.length));
                  read2 = 0;
               } while ((data2 = input2.take()) != null);
               break;
            }
            read1 = 0;
            end1 = data1.length;
         }
         
         if (read2 == end2) {
            data2 = input2.take();
            if (data2 == null) {
               do {
                  output.put(Arrays.copyOfRange(data1, read1, data1.length));
                  read1 = 0;
               } while ((data1 = input1.take()) != null);
               break;
            }
            read2 = 0;
            end2 = data2.length;
         }
      }
      
      // signal done
      output.put(null);
   }

   private static void sort3(final int[] values) {
      
      final SimpleBlockingQueue<int[]>[] inputs = new SimpleBlockingQueue[4];
      for (int i = 0; i < inputs.length; i++) {
         inputs[i] = new SimpleBlockingQueue<int[]>();
      }
      
      final SimpleBlockingQueue<int[]>[] outputs = new SimpleBlockingQueue[2];
      for (int i = 0; i < outputs.length; i++) {
         outputs[i] = new SimpleBlockingQueue<int[]>();
      }
      
      final SimpleBlockingQueue<int[]> target = new SimpleBlockingQueue<int[]>();
      
      for (int i = 0; i < 4; i++) {
         final int j = i;
         pool.execute(new Runnable() {
            @Override
            public void run() {
               
               int from = values.length / 4 * (j + 0);
               int to = values.length / 4 * (j + 1);
               
               Arrays.sort(values, from, to);
               
               inputs[j].put(Arrays.copyOfRange(values, from, to));
               inputs[j].put(null);
            }
         });
      }
      
      final int chunkSize = 1024 * 1024;
      
      for (int i = 0; i < 2; i++) {
         final int j = i;
         pool.execute(new Runnable() {
            @Override
            public void run() {
               streamMerge(inputs[j * 2 + 0], inputs[j * 2 + 1], outputs[j], chunkSize);
            }
         });
      }
      
      for (int i = 0; i < 1; i++) {
         final int j = i;
         pool.execute(new Runnable() {
            @Override
            public void run() {
               streamMerge(outputs[j + 0], outputs[j + 1], target, chunkSize);
            }
         });
      }
      
      int offset = 0;
      for (int[] data; (data = target.take()) != null; offset += data.length) {
         System.arraycopy(data, 0, values, offset, data.length);
      }
   }

theagentd · January 27, 2012, 5:11pm

I am entertained. =D

Riven · January 27, 2012, 8:44pm

I’ve got a bigger better faster multi-threaded merger now.

It does something like this:

It concurrently sorts 4 chunks of the input
It samples the chunks at a fixed interval, to see how it’s best to split each chunk into 8 smaller parts
It takes the average of the previous step and actually splits each chunk into 8 parts
[list][li]chunks[n].parts[0] holds all smallest values,
chunks[n].parts[7] holds all largest values[/li]
It concurrently merges the all chunks[n].parts[k] into temporary array temps[k]
[li]temps[0] holds all (ordered) smallest values
temps[7] holds all (ordered) largest values[/li]
It concatenates temps[n] into input[/list]

Time: 11.4s :cranky:

31.4s -> 11.4s = 2.75x

Riven · February 2, 2012, 10:29am

–err.

Going to post new results ‘soon’ !

xsvenson · February 2, 2012, 10:39am

I bow before Thee, OH Great Sorting God !

bows

Very interesting thread, now I just have to do something that requires sorting

Roquen · February 2, 2012, 3:11pm

Signed shifting will give you -1 & 0, then the multiplies can be converted into ANDs.

Riven · February 2, 2012, 3:15pm

I already changed it to be like that, but there was a serious bug, so I removed the code temporarily.