Cache locality and LibStruct - fixing Java's horrible memory layout

theagentd · May 19, 2014, 2:25am

TL;DR: I’m making a multi-threaded server for a 2D space game, and it was performing very badly on AMD CPUs. I used a library developed by Riven called LibStruct to replace my Java instances with structs that I could control the location of, managed to improve the performance on AMD CPUs by almost 50%.

EDIT: A BIG thanks to Spacebeans for letting me test the program on his 8-core AMD CPU! Throw a post in here and I’ll give you a medal! =D

Lately I’ve been experimenting with a server for a game I’m trying to make, to see how feasible my idea is. The game is supposed to be a 2D space game with a massive number of entities (preferably at least 100 000), with accurate gravity calculations and collision detection. To allow for extremely large worlds with high and uniform precision all coordinates and velocities are stored as fixed-precision 96-bit integers instead of doubles. To do real-time collision detection I decided to go with Broad-Phase Collision Detection. This technique basically involves sorting all the entities along the X axis and then using a smart traversing algorithm to find out which entities that potentially collide. For maximum performance, I decided to thread the engine as well.

The test program I developed currently consists of 4 main passes:

Updating: Updates each entity by computing gravity acceleration and updating the entity based on its velocity.
Sorting: Sorts all entities along the X-axis.
Collision detection: Identified collision pairs.
Rendering: Transforms each ship for rendering. This is mainly for visualizing the results.

I wanted to see if I could multithread it all and get good scaling.

Updating: Threading updating was simple since each entity can be updated independently of all other entities.
Sorting: I decided to go with a parallel insertion sort algorithm that I developed specifically for data sets that are almost sorted (the order of objects barely change in a single update).
Collision detection: I managed to modify the traversing algorithm to be able to split up the work into into independent chunks that could be processed in parallel.
Rendering: Simple to thread as well since all entities can be transformed in parallel.

The first version of my program worked pretty well. The scene tested was a planet with a large number of ships in fast low orbits, orbiting both clockwise and counter-clockwise, which gave a decent, realistic sorting load. Collision detection worked well and could detect collisions between objects that were less than a meter large and travelling at several kilometers per second. The only problem was performance. I could not even reach 30 FPS with 50 000 entities, far from my goal of 60 FPS with 100 000 entities. After that, I switched my focus to implementing threading, and DAYUM did it make a difference. I was running my test on an i7-4770K processor with Hyperthreading.

Results of Intel i7-4770K:


Task       <timing of 1 thread> / <timing of 8 threads> = <scaling>

Update:    10,886969 / 2,430073 = 4,4800995690252926558173355286035
Sort:       5,645249 / 0,888223 = 6,3556663135271210045225129274968
Collision: 13,476599 / 2,127155 = 6,3355039947723602652369009310558
Render:     5,608626 / 1,026621 = 5,4631904081447778683662227832861

Total:     35,706997 / 6,523294 = 5,4737678540933460917137875435325

What the hell? 6.36x scaling on a quad core?! Yep! These timings are the average timings when running the test for over a minute, and they’re 100% consistent. This additional performance is gained by utilizing Hyperthreading. Hyperthreading allows each core to have TWO threads “loaded in” at the same time, and if one of the threads stall for some reason (a cache miss or a branch prediction miss for example) it can very quickly switch to the other thread. If the program is extremely heavy on cache misses, this can indeed grant a 2x performance boost, but this is extremely rare of course. So basically, my code was extremely heavy on cache misses, which lead to Hyperthreading making a huge difference. Before I attempted to look into this issue more, I asked Spacebeans to run the test on his 8-core AMD FX-8320 CPU and give me the timings. His results were much less optimistic:

Results of AMD FX-8320:


Task       <timing of 1 thread> / <timing of 8 threads> = <scaling>

Update:    16,018526 /  2,856606 = 5,6075377563444171159760919076695
Sort:       6,851715 /  2,446308 = 2,8008390603309149951682290210391
Collision: 16,008198 /  3,536307 = 4,5268122931634612040187687324658
Render:     6,617009 /  2,420236 = 2,7340346148061593993313048810116

Total:     45,587364 / 11,337404 = 4,0209702326917167280975433176766

The only part of the test that scales reasonable well is the updating, which is the part that consists mostly of brute force computations. Sorting and rendering has abysmal scaling, under 3x despite the program utilizing all 8 cores. His AMD CPU had pretty much opposite scaling results compared to my Intel CPU. AMD CPUs are known for having worse per-core performance than Intel CPUs, and this is in large part due to Intel having a better memory controller. If Hyperthreading is helping a lot when it comes to sorting, collision detection and rendering performance, and AMD’s CPUs are weaker in those tasks, it further implies that there are a shitload of cache misses going on in those parts of the test. Taking a look at my Ship class revealed the problem rather quickly. Each Ship has 2 instances of Vector2 for position and velocity, and each Vector2 has 2 instances of Int96 for X and Y, which finally hold the actual primitive data types. Java doesn’t let us allocate all these objects at the same location in memory, and may even move around stuff during garbage collections. What this means is that each entity had its data spread out over (1+2+4) = 7 different locations in RAM! When sorting the array, it would essentially first hit a cache miss when following the reference to each Ship instance, then hit a cache miss when following the position Vector2 of each Ship, and finally hit a cache miss when following the X coordinate stored as an Int96. Each Ship is essentially a linked list with 3 elements, which is horribly inefficient!

Looking into how each task works, I could tell the following:

Updating involves reading from all those 7 addresses, but the large amount of number crunching makes this task mostly dependent on arithmetic performance.
Sorting involves almost no number crunching, but does cause a large number of cache misses. In addition, we have a huge number of branch prediction failures, making this a very good case for Hyperthreading.
Collision detection also a lower number of cache misses and branch prediction failures than sorting, but also involves number crunching.
Rendering has a small amount of number crunching, but seems to be dominated by a large number of cache misses.
But what can we do to solve the cache miss issue? Java doesn’t let us decide where our instances are stored!

Enter LibStruct! LibStruct allowed me to redefine my Ship, Vector2 and Int96 classes to be allocated as basic structs instead of real Java instances. Using the malloc() and free() methods provided by the library, I could essentially force a much better memory layout for my entities, so that each Ship struct was lying (most of the time) right next to its Vector2s and Int96s. When the CPU hits a cache miss, it loads in a large block of data, which means that it often also fetched the Vector2s and Int96s of each Ship instance when the Ship is accessed. The question is, how much does this cache locality matter?

Results of Intel i7-4770K:


Task       <timing of 1 thread> / <timing of 8 threads> = <scaling>

Update:    12,40554  / 2,489921 = 4,982302651369260309865252752999
Sort:       4,348568 / 0,722247 = 6,0208875910872596217083629284718
Collision: 13,828989 / 2,353232 = 5,8765939779843211379073546509651
Render:     4,341144 / 0,971539 = 4,4683167634032190164265150446868

Total:     34,966988 / 6,574033 = 5,318955350543570438420373003908

Damn, looks like it didn’t help much here. We see some overhead in the updating and collision detection parts due to the structs, but this is outweighed by better sorting and rendering performance in both single-threaded and multi-threaded tests, implying that the goal of improving cache performance was successful to some extent. The reduced scaling of sorting, collision detection and rendering also implies that Hyperthreading now has less of an impact. Let’s take a look at the AMD FX-8320!

Results of AMD FX-8320:


Task       <timing of 1 thread> / <timing of 8 threads> = <scaling>

Update  20,080358 / 3,403189 = 5,9004533688843023411276893525455
Sort     6,043065 / 0,941988 = 6,415225034713818010420514911018
Collis- 17,833417 / 2,317607 = 7,6947545463920328166078200488694
Render   6,068055 / 0,952332 = 6,3717852597623517848817429215862

Total   49,985323 / 7,626102 = 6,5545049095855261311742224271325

Holy shit. We see a MASSIVE performance boost in the most memory intensive parts of the test. Sorting performance is up by 160%, collision detection performance is up 53% and rendering performance is up 154%, but updating is 16% slower. In total, this adds up to a 49% boost in total performance, making the AMD FX-8320 competive with the Intel i7-4770K, as it should be. Scaling is also massively improved, mostly hovering at around 6x but achieving almost linear scaling for collision detection.

So there you have it!

Roquen · May 19, 2014, 7:54am

I’d want to stick with prim sized values. Use 32-bit integers. But I have a 64-bit machine on a 64-bit VM? Sure but HotSpot devs have spent less time writing transforms for 64-bit values so you’ll get better performance with 32-bit. So you might want to think about some partitioning your space and using local coordinates.

Another thing to think about is some data partitioning and laying out spatial data in a flat double buffered array. That way the collision detection could be multithreaded. One array is read-only for sim step ‘n’ and the other write only for sim-step ‘n+1’ so no false sharing, then flip.

Endos · May 19, 2014, 8:32am

Why you need to compute 100 000 entities if it’s impossible to have all of them on screen? Compute only what is on screen and you will get 60 fps easily.

ctomni231 · May 19, 2014, 8:44am

I think it is because it is simulating all the entities at once. Therefore, since it is online, I believe faking it won’t cut it for accuracy of all the spaceships.

Exactly how precise does the collision detection have to be? Even though I love the idea of controlling the data through structs, possibly just packing the relevant data into one integer would yield better results. Honestly, the GC always worked against me when dealing with multiple creations and deletions even on the smallest scale.

Roquen · May 19, 2014, 8:55am

No that’s not quite how it works. Each core has two threads and only partial hardware support for each (like each has it’s own register set). The “first” core gets to use any available resource as if it were running by itself (not quite true, but close enough) at a given clock tick. Once the first core has chosen what execution units it’s using that tick, then second looks at what execution units its want to use and if any are “free” (the first core isn’t using it) it jumps in and and uses it/them. So even if the first thread isn’t stalled the second thread can (usually) make progress. Of course if the “first” thread stalls and isn’t swapped out then the “second” thread will be running a full rate. This is the hand-wavey high view.

princec · May 19, 2014, 12:26pm

The other thing about hyperthreading is that quite often one core will stall completely waiting for memory accesses, and while it’s waiting it’s not using any processing units at all so the theory is the other hardware thread gets to jump in a bit more than might otherwise be expected.

Cas

theagentd · May 19, 2014, 2:39pm

What exactly are you proposing? I currently have an Int96 class which holds three 32-bit integers and emulates 96-bit math (I only need addition and subtraction). Having coordinates relative to certain points in space would possibly simplify certain calculations, but problems are introduced when sorting entities and running collision detection, since I essentially need to be able to determine their order. Although this can be solved, it does carry a certain amount of overhead. I have yet to experiment with this though, but it might be worth looking into.

I already have very good multithreading working for my collision detection. If you’re interested, feel free to look at my code: http://www.java-gaming.org/?action=pastebin&id=944. It’s inspired by the Bentley-Ottmann algorithm.

This is for a server, so there is no “screen”. The server needs to run accurate computations of all entities all the time. The clients on the other hand will not need to run any collision detection or sorting, and will only need to show an approximation of the entities that the client can see.

The collision detection is currently has better at least millimeter precision, depending on the speed difference between the two entities colliding. Acceleration and collision detection is actually done using doubles instead of Int96s. Certain values such as the distance between an entity and a planet (for calculating gravity) are a much better fit for floating point precision since the precision requirements get smaller and smaller the larger the distance is, so I only calculate what I need to calculate with my Int96 class. Here’s some of the code being run when updating the game for calculating the gravity a certain planet applies on a ship:


		StructVector2 temp = new StructVector2(new StructInt96(), new StructInt96()); //Stack allocated! No garbage! =D
		
		temp.set(planet.getPosition()).sub(ship.getPosition());
		
		//Convert dx and dy to doubles
		double dx = temp.getXAsDouble();
		double dy = temp.getYAsDouble();
		
		double distSqrd = dx*dx + dy*dy;
		double dist = Math.sqrt(distSqrd);
		
		double mul = SpaceSimulation.GRAVITATIONAL_CONSTANT * mass / (dist*distSqrd);
		
		ship.addVelocity(dx*mul, dy*mul); //Convert back to Int96

I’m not sure how packing all the data of a ship into an integer would work when just the position and velocity of the ship require around 384 bits, far exceeding what I can fit in a 32-bit integer. GC performance shouldn’t be a problem if Riven has done his job right with the structs’ malloc() and free() functions. The test program at the moment does not generate any garbage at all in the game loop (except for Strings when printing stuff). As a third point, although the game is real-time, it’s far from twitch-based. The game is meant to be played over much longer periods of time. I was once a fan of Ogame, but was annoyed to death by the fact that it was basically a new skin over any other generic browser game in that category. It’d be perfectly acceptable to occasionally get spikes on the server side, since the clients will be around 10 seconds behind the server at any time.

Roquen:

Hyperthreading allows each core to have TWO threads “loaded in” at the same time, and if one of the threads stall for some reason …

No that’s not quite how it works. Each core has two threads and only partial hardware support for each (like each has it’s own register set). The “first” core gets to use any available resource as if it were running by itself (not quite true, but close enough) at a given clock tick. Once the first core has chosen what execution units it’s using that tick, then second looks at what execution units its want to use and if any are “free” (the first core isn’t using it) it jumps in and and uses it/them. So even if the first thread isn’t stalled the second thread can (usually) make progress. Of course if the “first” thread stalls and isn’t swapped out then the “second” thread will be running a full rate. This is the hand-wavey high view.

Interesting! That means that Hyperthreading is even more advanced that I thought, which makes sense considering the performance increase I got in sorting and collision detection.

theagentd · May 19, 2014, 9:27pm

I did some optimizations. I realized that my Ship structs were 4 bytes larger than they were supposed to and fixed that. I also moved some stuff that didn’t have to be repeated for each iteration of a loop outside of the loop.

i7-4770K:


Update:    11,785027 / 2,362425 = 4,9885295829497232716382530662349
Sort:       4,363652 / 0,675312 = 6,4616828962020517923567180799393
Collision: 12,803392 / 2,093591 = 6,1155173097324166945692831121265
Render:     4,018574 / 0,779708 = 5,1539473751712179431274271906919

Total:     32,999605 / 5,943186 = 5,5525108923059113411560735268928

Got a decent 10.6% speed up, and at least when using 8 threads the struct version is now faster in all four parts of the game loop!

Spacebeans · May 19, 2014, 10:17pm

I have absolutely no idea what this is, but thanks for letting me test it!

Longor1996 · May 20, 2014, 10:57am

Could LibStruct be useful when making a raytracer in Java?
The biggest problems I got in my Raytracer is the creation/management of vectors (and I am already reusing them), and the use of methods which I can’t manually inline without causing a big mess.

That could be a good example for Multithreading when I think about it.
Gotta try that out, with LibStruct.

Have a nice day.

Longor1996

Nate · May 20, 2014, 3:43pm

How about showing some code so I can see the difference with and without LibStruct without putting in much effort?

Riven · May 20, 2014, 3:46pm

Synthetic benchmark results:

Making 10M instances of Vec3:


tInstance1      	     47ms 	209M/s     // new RegularVec3()
tStackAlloc1    	     66ms 	149M/s     // new Vec3() once in 1 method, slow, there is overhead per method to fetch the struct-stack
tStackAlloc1N   	     16ms    610M/s     // new Vec3() 10 times in a loop in 1 method
tStackAlloc10N  	     14ms    702M/s     // new Vec3() 10 times unrolled in 1 method
tStackAllocArr  	     31ms    319M/s     // new Vec3[10] (implies stack allocation of 10 structs and a container)
tMemoryAllocFree       111ms     89M/s     // Struct.free(Struct.malloc(Vec3.class));
tMemoryAllocFreeArr  	60ms 	165M/s     // Vec3[] vecs = Struct.malloc(Vec3.class, 100); for(Vec3 vec: vecs) Struct.free(vec);
tMemoryAllocBulkFreeArr 35ms 	280M/s     // Struct.free(Struct.malloc(Vec3.class, 100));

To reproduce, ‘just’ run the StructTest class, with the LibStruct agent

Riven · May 20, 2014, 3:50pm

@Nate: I’ll work on that, later today.

How does one showcase a technology that most likely slightly complicates your sourcecode, which reduces memory latency and thwarts GC pauses… I’m hesitant to write two full fledged visual tech-demos :emo:

saucymeatman · May 20, 2014, 8:42pm

Is this way over anyone else’s head? haha

lcass · May 20, 2014, 9:01pm

I get a little bit of it however he is “number dumping” a lot.

Riven · May 20, 2014, 9:09pm

The thing is that performance is all about numbers - it aint sexy, so it’s hard to convince people to join in.

theagentd · May 20, 2014, 10:16pm

What the hell is wrong with you? >:(

Riven · May 20, 2014, 10:21pm

Poor grammar skills. Let me correct myself: “it is not heartthrobish”

Riven · May 28, 2014, 11:30am

I added 2 versions of a demo app to the LibStruct repo: one based on objects, one based on structs.

With structs: calculation takes ~45ms, GC never runs (!) after init - we allocate about 64MB per frame.
With objects: calculation takes ~75ms, GC pauses about twice per second with ~30ms collections.

It needs to be mentioned that this demo does not take advantage of memory locality - all structs are scattered over memory - therefore it only tests allocation performance and I/O throughput while being just as cache-trashing as the version based on objects.

A screenshot of SoftlyLit demo with 4096 (64²) randomly placed light sources and 64 moving/rotating walls:
You can also peek at this poor quality GIF video (40MB!)

To run the demo implemented with objects:


Main Class: 
   test.net.indiespot.demo.softlylit.objects.Demo

JVM args:
   -Djava.library.path=./lib/lwjgl-2.9.1/native/windows

To run the demo implemented with structs:


Main Class: 
   test.net.indiespot.demo.softlylit.structs.Demo

JVM args:
   -javaagent:struct-agent.jar=test/net/indiespot/demo/softlylit/structs/struct-def.txt
   -Djava.library.path=./lib/lwjgl-2.9.1/native/windows

Roquen · May 28, 2014, 11:58am

As a related side-note: in days of yore it was claimed that escape analysis could only mark an object as non-escaping (and stack like allocate and make possible for scalar replacement) IF it’s being allocated in an effective method that isn’t passing it to any other method. By effective here I mean after any inlining of methods have occurred. I haven’t had the chance to look into this.