Any multithreading people out there? Multithreaded loading of a huge array.

Rayvolution · July 7, 2014, 10:56am

Hey all, so I’ve been looking for ways to accelerate my loading process. I came up with an idea to break up my massive int[1024][1024][8] array into smaller chunks to be stored on the hard disk. Each chunk would be a int[256][256][8] array, so there would be a total of 16 arrays. (currently the entire int array is compressed and stored as one massive file)

My game saving is already multithreaded, it just makes a deep copy of my array on the main thread, and sends it off to another thread to be serialized. But, for loading I still do it in the main OpenGL context thread because the game needs to load the map to continue anyway. But, I had an idea; what if I used 16 different threads to load each of the [256][256][8] parts all at once, then, once all 16 are loaded whatever thread finishes last reassembles the entire map and then unlocks the OpenGL context thread?

I have to do a decent amount of refactoring just to test this, so while I play with the idea I was hoping to get some feedback on it from someone that has more multithreading experience than I do.

Basically, it would work like this:

The OpenGL context thread flags for the map to load, sets [icode]mapLoading = true[/icode] and locks itself with a while loop [icode]while(mapLoading){Sit at loading screen, maybe animate something?}[/icode]
16 new threads are created, each one for a 256x256 part of my map. From what I understand of multithreading is if the threads/cores don’t exist, it’ll just wait until one opens up. So in theory I can open 16 threads regardless if the CPU is a single, dual, quad, 6 or 8 core processor. It’ll just use what it can when it can.
Each thread loads it’s part of the map, when it’s done it sets [icode]mapLoaded[x]=true;[/icode] (where x is one of the 16 pieces).
All threads will run an IF statement to see if all the mapLoaded booleans are true, so the last thread to run should clear the if statement and run.
The last thread reassembles the entire 1024x1024 map from the 16 pieces then flags [icode]mapLoading = false[/icode], thus unlocking the openGL context and allowing the game to continue.

Visual illustration:

http://sixtygig.com/junk/multithreaded2.png

BurntPizza · July 7, 2014, 11:04am

Loading files on multiple threads is a no-no, that’s the fastest way to trash your performance as the disk is a serial device, best used in long continuous read/writes… skipping around on the disk is like skipping around in memory and nullifying your cache except thousands of times worse

Best bet is the high level optimization: do you really need that much data? 1024 * 1024 * 8 * 4 = 32 Mb
Perhaps a [icode]byte[][/icode] would suffice (or utilize all 32 bits of each int if you’re not)

Next opt after that would be data compression, something fast like Snappy or LZO can actually be faster to read + decompress than an uncompressed read, disks are that slow, esp if you use multithreading on the decomp.

Rayvolution · July 7, 2014, 11:13am

Blast! You’re correct, I totally forgot about concurrent loading on a hard drive. I wasn’t even factoring that in when I had the idea. Well damn!

Sadly though, I can’t use a byte[][][] array, because I need int IDs up in the millions (quite literally). Part of the way I designed the engine was to assign each tileset a collection of tiles based on what type of tile it is. So currently I am already using ints as large as 2,000,000 to represent global tile IDs in the tile loader.

(note that I don’t actually have 2 million tile IDs, but there’s purposely gaps in the tileID system to inject tile global IDs inbetween others later. Less than 1 million are terrains, 1-2 million are deco objects and walls, and 2 million+ are buildings and functional objects. Each primary tileset has a 10,000 tile gap with 400 tile gaps between each tileset subtype)

Really the load time itself isn’t all that awful, my master plan was to convert it from being 3-5 seconds to being nearly instant (perceptively), but I completely forgot to factor in the hard drive can’t read concurrently anyway so loading multiple files at once is actually slower than 1 at a time.

I guess I may have to go back to my old idea, breaking it up into 16 files anyway, and loading 1 by 1 in a single thread, and progressing a loading bar/animation every time 1 piece is loaded. I’ll also look into Snappy and LZO, currently I’m just using GZIPInputStream/GZIPOutputStream

BurntPizza · July 7, 2014, 11:17am

Depending on how the data is structured, it might be worthwhile to try a simple run-length encode, dictionary, or even a huffman pass over it instead of gzip, might actually wind up faster. Profile and see if the IO waits on the gzip decomp or if the CPU is actually starved.

richierich · July 7, 2014, 11:23am

No Java multithreading experience but I used to do a lot of work with databases on disk. Often it turned out that the disk head movement was the bottleneck, so multithreading small reads didn’t help. Up to a point what did help was reading really large chunks (i.e. many sectors at a time), because heads can sometimes grab a few in one rotation. Also keeping stuff together in one file is usually better because the head can stay still rather than dodging back and forth between several file locations.

Also as BurntPizza says, compressing the data is worth trying. Maybe even actually use a database. I’m fairly new to Java so I don’t know which databases are simple and available, but they often have highly tuned and effective compression options built in.

Rayvolution · July 7, 2014, 11:30am

I actually thought about doing that, but I’d need to code a bunch of overhead just to test it. (mainly just recoding the saving/loading functions) A friend of mine is a database programmer so I was going to bug him about some of his database solutions if I go that route. I know there’s several low-overhead SQL databases out there for java, but I don’t need all the fluff SQL has. All I’m storing is a whole mess of ints. I also know there’s some really fast “NoSQL” solutions, I just can’t recall their names at 6:30AM :

princec · July 7, 2014, 12:26pm

Does nobody get taught anything in college these days? :emo:

Loading multiple files in multiple threads is exactly the use case of threads… load each file in a separate thread! Do it!

Cas

BurntPizza · July 7, 2014, 12:42pm

Start college in August

Although, correct me if I’m wrong, loading files asynchronously is a good idea, as the CPU can do stuff while you wait for the disk, performing simultaneous IO to different parts of the disk (non-contiguous operations) is counter-productive, is it not?

EDIT: I guess really, as always, an actual test is necessary before optimizations can be assumed!

Rayvolution · July 7, 2014, 12:52pm

Orly? Guess I’ll have to try it and see what happens.

princec · July 7, 2014, 12:58pm

A thread is generally doing one of two things*: waiting on I/O or running. I/O arrives in little dribs and drabs, and when you get a bit, you do a little bit of work with it, and then go back to waiting for more I/O to occur. So imagine you’ve got a pool of say 8 threads doing your I/O reading… chances are one of them will be reading a bit from the disk, and the other 7 will be either doing a little bit of work with the last bit of data they received or sat idle. End result is your disk is flat out, and you’re using as much CPU as you can… which is the goal.

Cas

unless it’s otherwise waiting on synchronisation or whatever

BurntPizza · July 7, 2014, 1:04pm

I get that, but if multiple threads are “requesting” reads on the disk, then I would think the disk wastes time going between the somewhat random locations, instead of reading big blocks at a time, in sequential order, as you would with a single thread. I guess I’m saying you should be able to get the disk flat out on one thread, adding threads keeps it maxed out, but behaving less efficiently, similar to how walking a large array in random order is way slower than a linear scan, even if the reason for the slowdown is different (cache issue vs. mechanical device latency)

Waiting on a bench, searching for some evidence…

BurntPizza · July 7, 2014, 1:12pm

The third test case, a single-drive laptop, is probably the most applicable and representative of the target audience of a video game of the three, and on the second page you can see that in the first test, “Sequential Read” of files, performance drops sharply with even just 2 threads, and continues to decline as more are added, and I believe that test is the same situation Ray is in.

EDIT: however you can also see that if you can detect certain RAID configurations, you can reap a nice speedup via multithreading!

princec · July 7, 2014, 1:29pm

Look more at the use-case that you’re representing… lots of relatively small reads from many small files, rather than attempting multiple large simultaneous sequential reads. You can see the laptop actually peforms best with 8 threads in this situation.

Cas

BurntPizza · July 7, 2014, 1:48pm

Hmm… it might be superior with certain tuning (file size vs. read size vs. thread count, etc), although I feel that it would be quite easy to lose all improvement if the balance isn’t quite right for that particular machine/device setup… For now I’m going to maintain my position that it’s easier to optimize for one-thread read than to juggle many threads and risk massacring perf, though more thorough testing is in order to determine how risky it might actually be. Either way, this is not the lowest fruit, gains can still be made more easily elsewhere I suspect.

PandaMoniumHUN · July 7, 2014, 3:21pm

Setting up a non-blocking file loading system (like Node.JS does) with callbacks is beneficial because the game loop thread won’t have to wait for the IO operation to finish, you just supply a callback and it gets fired after the file has been loaded. However for file loading the one thread/file approach will only give you performance benefits if your file sizes are huge. If the files are small and there is a lot of them multiple threads might end up being slower than a single thread because of the context switching. But who knows, give it a try and you’ll see what’s best for your game. :point:

BurntPizza · July 7, 2014, 3:32pm

It seems the opposite is true (I guessing it’s related to the disk cache), either way the context switching is not at all the bottleneck, it’s always going to be the disk.

princec · July 7, 2014, 3:32pm

Indeed, I hope you’ve actually identified this as an actual problem that needs solving rather than just you tinkering away making things faster that no-one knows about, or cares about, or even notices in the first place. Watch that talk given by Jon Blow. If you can find it.

Cas

princec · July 7, 2014, 3:34pm

Hint: it’s almost certainly not something you should be coding… probably ever.

Cas

theagentd · July 7, 2014, 6:19pm

This is completely false in the first place. Both HDDs and SSDs perform better when you read multiple things from them using multiple threads. For HDDs, this is because it can optimize the searching by basically reordering the reads so that the reading head takes a more efficient “path”. It can also queue up and pipeline the requests better, similar to how OpenGL works.


CPU: request file-------------------------------file received, request next--------------
HD:  -----------start reading~~~~~~~~~~~~~~~done----------------start reading~~~~~~~~~~~~

As you can see, a small stall occurs once the HD has finished reading a file. If you have multiple threads, it immediately knows what to do after.

For an SSD, using multiple threads is even more ideal. SSDs have the amazing ability to write and read from multiple parts of itself in parallel, so if the files are in different places on the SSD, you can basically read several times faster from it by having multiple threads request files.

Look up any harddrive or SSD benchmark and you’ll see that they check it with different “queue depth”, e.g. how many threads they use to read or write with.

Roquen · July 7, 2014, 6:47pm

Compression is your friend.