TL;DR: I’m making a multi-threaded server for a 2D space game, and it was performing very badly on AMD CPUs. I used a library developed by Riven called LibStruct to replace my Java instances with structs that I could control the location of, managed to improve the performance on AMD CPUs by almost 50%.
EDIT: A BIG thanks to Spacebeans for letting me test the program on his 8-core AMD CPU! Throw a post in here and I’ll give you a medal! =D
Lately I’ve been experimenting with a server for a game I’m trying to make, to see how feasible my idea is. The game is supposed to be a 2D space game with a massive number of entities (preferably at least 100 000), with accurate gravity calculations and collision detection. To allow for extremely large worlds with high and uniform precision all coordinates and velocities are stored as fixed-precision 96-bit integers instead of doubles. To do real-time collision detection I decided to go with Broad-Phase Collision Detection. This technique basically involves sorting all the entities along the X axis and then using a smart traversing algorithm to find out which entities that potentially collide. For maximum performance, I decided to thread the engine as well.
The test program I developed currently consists of 4 main passes:
- Updating: Updates each entity by computing gravity acceleration and updating the entity based on its velocity.
- Sorting: Sorts all entities along the X-axis.
- Collision detection: Identified collision pairs.
- Rendering: Transforms each ship for rendering. This is mainly for visualizing the results.
I wanted to see if I could multithread it all and get good scaling.
- Updating: Threading updating was simple since each entity can be updated independently of all other entities.
- Sorting: I decided to go with a parallel insertion sort algorithm that I developed specifically for data sets that are almost sorted (the order of objects barely change in a single update).
- Collision detection: I managed to modify the traversing algorithm to be able to split up the work into into independent chunks that could be processed in parallel.
- Rendering: Simple to thread as well since all entities can be transformed in parallel.
The first version of my program worked pretty well. The scene tested was a planet with a large number of ships in fast low orbits, orbiting both clockwise and counter-clockwise, which gave a decent, realistic sorting load. Collision detection worked well and could detect collisions between objects that were less than a meter large and travelling at several kilometers per second. The only problem was performance. I could not even reach 30 FPS with 50 000 entities, far from my goal of 60 FPS with 100 000 entities. After that, I switched my focus to implementing threading, and DAYUM did it make a difference. I was running my test on an i7-4770K processor with Hyperthreading.
Results of Intel i7-4770K:
Task <timing of 1 thread> / <timing of 8 threads> = <scaling>
Update: 10,886969 / 2,430073 = 4,4800995690252926558173355286035
Sort: 5,645249 / 0,888223 = 6,3556663135271210045225129274968
Collision: 13,476599 / 2,127155 = 6,3355039947723602652369009310558
Render: 5,608626 / 1,026621 = 5,4631904081447778683662227832861
Total: 35,706997 / 6,523294 = 5,4737678540933460917137875435325
What the hell? 6.36x scaling on a quad core?! Yep! These timings are the average timings when running the test for over a minute, and they’re 100% consistent. This additional performance is gained by utilizing Hyperthreading. Hyperthreading allows each core to have TWO threads “loaded in” at the same time, and if one of the threads stall for some reason (a cache miss or a branch prediction miss for example) it can very quickly switch to the other thread. If the program is extremely heavy on cache misses, this can indeed grant a 2x performance boost, but this is extremely rare of course. So basically, my code was extremely heavy on cache misses, which lead to Hyperthreading making a huge difference. Before I attempted to look into this issue more, I asked Spacebeans to run the test on his 8-core AMD FX-8320 CPU and give me the timings. His results were much less optimistic:
Results of AMD FX-8320:
Task <timing of 1 thread> / <timing of 8 threads> = <scaling>
Update: 16,018526 / 2,856606 = 5,6075377563444171159760919076695
Sort: 6,851715 / 2,446308 = 2,8008390603309149951682290210391
Collision: 16,008198 / 3,536307 = 4,5268122931634612040187687324658
Render: 6,617009 / 2,420236 = 2,7340346148061593993313048810116
Total: 45,587364 / 11,337404 = 4,0209702326917167280975433176766
The only part of the test that scales reasonable well is the updating, which is the part that consists mostly of brute force computations. Sorting and rendering has abysmal scaling, under 3x despite the program utilizing all 8 cores. His AMD CPU had pretty much opposite scaling results compared to my Intel CPU. AMD CPUs are known for having worse per-core performance than Intel CPUs, and this is in large part due to Intel having a better memory controller. If Hyperthreading is helping a lot when it comes to sorting, collision detection and rendering performance, and AMD’s CPUs are weaker in those tasks, it further implies that there are a shitload of cache misses going on in those parts of the test. Taking a look at my Ship class revealed the problem rather quickly. Each Ship has 2 instances of Vector2 for position and velocity, and each Vector2 has 2 instances of Int96 for X and Y, which finally hold the actual primitive data types. Java doesn’t let us allocate all these objects at the same location in memory, and may even move around stuff during garbage collections. What this means is that each entity had its data spread out over (1+2+4) = 7 different locations in RAM! When sorting the array, it would essentially first hit a cache miss when following the reference to each Ship instance, then hit a cache miss when following the position Vector2 of each Ship, and finally hit a cache miss when following the X coordinate stored as an Int96. Each Ship is essentially a linked list with 3 elements, which is horribly inefficient!
Looking into how each task works, I could tell the following:
- Updating involves reading from all those 7 addresses, but the large amount of number crunching makes this task mostly dependent on arithmetic performance.
- Sorting involves almost no number crunching, but does cause a large number of cache misses. In addition, we have a huge number of branch prediction failures, making this a very good case for Hyperthreading.
- Collision detection also a lower number of cache misses and branch prediction failures than sorting, but also involves number crunching.
- Rendering has a small amount of number crunching, but seems to be dominated by a large number of cache misses.
But what can we do to solve the cache miss issue? Java doesn’t let us decide where our instances are stored!
Enter LibStruct! LibStruct allowed me to redefine my Ship, Vector2 and Int96 classes to be allocated as basic structs instead of real Java instances. Using the malloc() and free() methods provided by the library, I could essentially force a much better memory layout for my entities, so that each Ship struct was lying (most of the time) right next to its Vector2s and Int96s. When the CPU hits a cache miss, it loads in a large block of data, which means that it often also fetched the Vector2s and Int96s of each Ship instance when the Ship is accessed. The question is, how much does this cache locality matter?
Results of Intel i7-4770K:
Task <timing of 1 thread> / <timing of 8 threads> = <scaling>
Update: 12,40554 / 2,489921 = 4,982302651369260309865252752999
Sort: 4,348568 / 0,722247 = 6,0208875910872596217083629284718
Collision: 13,828989 / 2,353232 = 5,8765939779843211379073546509651
Render: 4,341144 / 0,971539 = 4,4683167634032190164265150446868
Total: 34,966988 / 6,574033 = 5,318955350543570438420373003908
Damn, looks like it didn’t help much here. We see some overhead in the updating and collision detection parts due to the structs, but this is outweighed by better sorting and rendering performance in both single-threaded and multi-threaded tests, implying that the goal of improving cache performance was successful to some extent. The reduced scaling of sorting, collision detection and rendering also implies that Hyperthreading now has less of an impact. Let’s take a look at the AMD FX-8320!
Results of AMD FX-8320:
Task <timing of 1 thread> / <timing of 8 threads> = <scaling>
Update 20,080358 / 3,403189 = 5,9004533688843023411276893525455
Sort 6,043065 / 0,941988 = 6,415225034713818010420514911018
Collis- 17,833417 / 2,317607 = 7,6947545463920328166078200488694
Render 6,068055 / 0,952332 = 6,3717852597623517848817429215862
Total 49,985323 / 7,626102 = 6,5545049095855261311742224271325
Holy shit. We see a MASSIVE performance boost in the most memory intensive parts of the test. Sorting performance is up by 160%, collision detection performance is up 53% and rendering performance is up 154%, but updating is 16% slower. In total, this adds up to a 49% boost in total performance, making the AMD FX-8320 competive with the Intel i7-4770K, as it should be. Scaling is also massively improved, mostly hovering at around 6x but achieving almost linear scaling for collision detection.
So there you have it!