Parallel insertion sort for almost sorted data sets

Hello.

Sorting has several important uses in games, but performance is usually not that important due to the data set being sorted usually only consisting of a small number of elements. However, there are cases where good performance is necessary. One of those cases is the broad phase of collision detection. The idea is simple: Sort your objects along one of the axes (say the X axis) and only process objects that overlap along this axis. In this case, sorting performance is obviously very important since the sorting part can easily get more expensive than the collision detection it’s supposed to optimize.

Sorting is difficult to do in parallel on multiple cores though. There are a few algorithms that can be partially threaded on the CPU (merge sort) and even some that can be run on GPUs (radix sort), but they generally do not scale that well with more cores. They are however much faster than single-threaded solution. Sorting for a broad phase has a very important quirk though: We can assume that the order of the objects does not change drastically between each collision detection pass due to the nature of how the objects move around. In the case of a space game with objects moving slowly through space, this is even more true.

Although O(n*log(n)) sorting algorithms like merge sort are fast for large data sets with randomly ordered elements, they’re not optimal for almost sorted data sets. A simple insertion sort is actually a very good choice in this case due to the fact that insertion sort runs at O(n) if the list is perfectly sorted already. Moving elements a small distance is also cheap, which fits my problem description perfectly. Insertion sort cannot however be easily parallelized to run on multiple cores due to the potential of a single element being moved from the end of the array to the beginning of the array, and the algorithm itself relies on that all elements before the element being processed are already sorted. However, we can make the assumption that almost all elements will only need to be moved a small distance, and we can optimize for this special case.

The idea is relatively simple. We split up the array into evenly sized chunks and use insertion sort to sort these chunks in parallel. Then we identify “overlaps” between chunks that need to be sorted to fix the ordering, and if necessary we run a second pass of insertion sorts on these overlaps. A simple algorithm is used to identify overlaps which conservatively sorts the array to guarantee correct order. The worst case is if the array is completely random, in which case the algorithm identifies the whole array as a single overlap area and the algorithm breaks down to a slow single-threaded insertion sort.

Here is a visualization of the algorithm.

To generate test data, we start with a sorted array.

We then shuffle around the data a bit to simulate the changing order of elements.

We split it up into chunks…

and sort them concurrently on multiple cores.

We identify the overlaps…

and sort them as well.

We’re done!

Performance of this algorithm is excellent. The following are the results of sorting 4 identical arrays using 4 different algorithms. The first value is for an array shuffled in the same manner as the visualization above, while the second one simply processes an already sorted list. The result is verified to be identical for all 4 algorithms (this verification is not included in the timings below). The test was run on an i7-4770K with Hyperthreading enabled. When using a single core, the processor is clocked to 3.9GHz. When using all 4 cores (8 threads for Hyperthreading) the processor drops to 3.7GHz.

Despite the lower clock speed when using all 4 cores, the parallel insertion sort achieves an almost ridiculous 6x performance boost over a single threaded insertion sort.

I also implemented the algorithm in a simple space simulation with 200 000 ships being affected by the gravity of a nearby planet. The ships first had their velocity and position updated and were then sorted by the parallel insertion sort. The below are the timings in milliseconds when using different number of cores. Both the ship updating and ship sorting is parallelized to run on any number of cores. Also note that the single core test used a non-parallel standard insertion sort to avoid the overhead of sorting the overlaps.

where is the code? :wink:

I’m still working on the overlap detection code. Although it currently always sorts the array correctly, it’s not optimal.

I believe I have managed to perfect the overlap detection and also optimized the overlap sorting slightly by skipping elements that are known to be sorted already.

My implementation of insertion sort: InsertionSort.java
My implementation of a parallel insertion sort: ParallelInsertionSort.java

All thread management is handled by the threading library I’ve developed (which is a powerful library for multithreading your whole game). The sorting code is well documented and should be easy to understand when combined with the visualization in the first post, so you can probably implement this with your own threading system if you use my code as reference.

If you wish to try out the search algorithm yourself without having to write your own implementation, you need to download my threading library (a massive 70 kilobyte download).

Threading library download link.

Running the sorting algorithm is as simple as this:



ParallelInsertionSort<SortedObjectTypeHere> sorter = new ParallelInsertionSort(numChunks);
MultithreadedExecutor executor = new MultithreadedExecutor(numThreads);

sorter.sort(arrayToSort, startIndex, endIndex, myComparator, executor);
//Of course, both the sorter and the executor can be reused any number of times.


Finally, here is a sorting performance test program I used to compare InsertionSort and ParallelInsertionSort with Java’s Arrays.sort() and Arrays.parallelSort(). This requires the insertion sort and parallel insertion sort code above and my threading library on the classpath.

SortTest.java

EDIT: The previous test had some peculiarities due to the fact that it was using Integer wrappers around ints. Here’s an improved test program:
SortTest.java

Sample results:

Unsorted: 5.34x faster
Sorted: 3.29x faster

For almost sorted data you might want to look at some of the bubble sort variants. I haven’t thought at all about the pains it might require to make massively parallel.

Just from a quick skim of your initial diagrams, I’m curious what the algorithm does when an element belongs in a chunk that isn’t a neighbouring chunk?
Eg if an element supposed to be in the last chunk, was placed in the first.

Are the overlaps rechecked until the list is completely sorted?

Various flavors of mostly sorted is a pretty common case.

EDIT: Warm the methods.

My priority here was correctness over performance, so if the algorithm detects that that a single element needs to be moved farther than 1 chunk away it will simply include all affected chunk in a single overlap. If the array is “too random”, it will actually end up being a single-threaded insertion sort over almost the whole buffer. Here is an example where I have 8 chunks and 1000 elements being shuffled around a bit too much. Note how the 3rd overlap extends over a total of 3 chunks due to the dip in the third chunk overlapping the top in the first chunk.

Here’s a visualization of the overlap sorting optimization I made. It simply skips the first elements that are already sorted in each chunk. The slightly darker red parts of each overlap is simply skipped over. On average this will skip half the elements being sorted (although the ones that were the cheapest to sort) as long as there are no overlaps over more than 2 chunks.

Yeah, I’m still working on improving the testing. More realistic numbers seem to point to around 4.1x scaling with 8 threads.

Also, the algorithm currently uses a very inefficient linear search for finding out the overlap sorting bounds (although this search is threaded per overlap). A better version would use a binary search to locate those bounds since the search is inside a single chunk and all chunks are already sorted.

I’ll toss together some generators for you to try.

Oh, sweet! That’d be really useful! I’m currently doing something like this to get more consistent scaling measurements:


long normal = 0;
long parallel = 0;

while(true){
    normal += testInsertionSort();
    parallel += testParallelInsertionSort();
    System.out.println((double)normal / parallel + "x scaling");
}

This seems to (and should) approach a stable and reproducible value, around 4.1x (with this particular data set and shuffling algorithm).

[quote]Scaling: 4.088887102760233
[/quote]
I’m gonna do some detail profiling on each part of the algorithm tonight.

EDIT: Left it running over dinner. Stabilized at around 4.26x scaling.

EDIT2: Here are some preliminary profiling results:

Note that the only single-threaded part of the algorithm is the “Update overlaps” part, which only depends on the number of chunks, not the number of elements in the array.

OK. A first pass is here: https://github.com/roquendm/JGO-Grabbag/blob/master/src/roquen/math/seq/TestIntSequence.java

Some expectations. The original test bed didn’t warm the sorting. If anyone bothered I’d expect that with and without warming the timings are going to be pretty much the same. Sorting is going to be dominated by stalls, notable branch-misprediction and memory architecture related (reads: waiting on cache filling and writes: full write buffers). Hopefully the parallel sorts are working far enough apart to not cause thrashing.

As such these test generators don’t copy data but fill or modify in place using a static PRNG.

So teleporting entities from one end of the domain to the other would be a catastrophe for the efficiency of this algorithm.
Cylindrical/toroidal coordinate systems are a big no no then!

No problem. Don’t represent the ‘universe’ in a single data structure.

If the universe is not a single data structure, then I can sort the separate data structures on separate threads, rather than multi-thread sorts on each structure.

It’s possible to work around this. You’ll always have to add and remove stuff to the list, so this needs to be addressed. Usually, you’d do the sorting after updating each object. So first you can loop through all objects, mark all objects that need to be either removed or teleported, and queue up all teleporting or newly created objects into a queue. You can then combine my parallel sorting algorithm with a double buffered list. Each chunk needs to know how many objects we currently have in it, how many of those we are going to remove and how many new/teleporting objects we are going to insert into it. Then we can simply run a merge between the (sorted) insertion queue and the chunk’s old objects simultaneously with the insertion sort.


//Before (normal insertion sort)
for each object in array 
    insert currentObject into its correct position in array
}

//After
int writeIndex = 0;
for each object in array1 {
    if object is dead or teleported {
        continue;
    }

    while next object in insertion queue should be before the current object{
        write queued object to array2[writeIndex];
        writeIndex++;
    }
    
    insert currentObject into its correct position in array, but write to array2!!!
    writeIndex++;
}

You have to consider that a merge operation is incredibly slow, as you (by definition) trash your cache by walking two input arrays and writing to one output array. A trivial merge is likely to ruin your performance.

Why exactly would a simple merge trash the cache? Shouldn’t the data easily fit the cache? Besides, branch prediction should not be a real problem since the number of objects being inserted will most likely be far lower than the number of elements that are already in the chunk, so almost all iterations will evaluate the nested while-loop as false which should be an easy pattern for branch prediction to guess (default to false).

Modern caches work well for multiple input arrays. I just don’t see the problem?

Regardless of how much slower this is, you need to make a tradeoff between the cost of adding the merge operation and the reduced parallelism that occurs as a result of elements being moved large distances in the array. Depending on how rarely this occurs, it may be faster to simply insert the object at its correct position using a binary search to find the index and then inserting it by shifting all following elements to make room for it.

EDIT: The data caches in my CPU (i7-4770K) are as follows:
L1: 32 KB, 8-way set associative, 64-byte line size per core
L2: 256 KB, 8-way set associative, 64-byte line size per core
L3: 8192 KB, 8-way set associative, 64-byte line size shared