Threads, games and running on all CPU's

mike_bike_kite · September 19, 2015, 9:31am

I’m currently creating a game where the player flies above a raging sea. The sea is a big part of the game so I’m trying to simulate it all (just in 2D) and have realistic waves etc. There’s also a fair amount of particles flying about that need to be accounted for. It’s just single threaded at the moment but runs fine (so far) on a fast processor (i5-3570K) but I’m pretty sure it will get unstuck on older processors. The obvious solution is to use separate threads but older, slower processors either have one or two cores so it seemed pointless going for too many threads.

Possible threads might be:

thread to play the game
thread to handle the sea, waves, splashes, general collisions
optional thread for sound?
thread to do initialisation and read/write high scores to the web

Would I do best with static arrays for those bits that are used by more than one thread? any other opinions?

Mike

theagentd · September 19, 2015, 10:06am

Giving individual parts of the game (physics, rendering, input, particles, etc) their own thread sounds nice in theory, but usually doesn’t improve much as the workload for each thread is very unbalanced, and it doesn’t scale well with more cores than you have tasks. The best solution is almost always to use an embarrassingly parallel algorithm and split up the workload between N threads, where N=Runtime.getRuntime().availableProcessors(). I personally dislike static variables. To actually get good scaling you need to avoid all synchronization, which means that your threads should do work that doesn’t write to memory shared with other threads.

Examples of things that are threadable for an almost linear increase in performance if coded right (these are pretty meaningless to thread unless you have more than 1 core):

Particle physics simulations. Very easy if particles don’t interact with each other.
Physics/collision detection. Requires some clever tricks and you may need to implement a less effective algorithm that however scales much better with the number of threads.
AI, assuming each unit makes its own decisions without any coordination with other objects.
Skeleton animation.

Things that can be threaded not for performance but to maintain responsitivity/quality regardless of core count:

Network code, to avoid blocking the main thread, although there are non-blocking ways of handling networking.
Sound playing should always be done from a different thread to prevent stuttering in case the game runs slowly and to prevent the main thread from getting stuck loading a file from disk during gameplay.

Catharsis · September 19, 2015, 10:20am

What’s your current organization like for your data?

Are you modeling things w/ OOP for the “sea, waves, splashes, general collisions”?

If yes to the above then restructuring things to a more DOD (data oriented design) ditching OOP may give you a lot more performance. IE primitive arrays of data.

You mention “static arrays for those bits”. That kind of connotes that you might already be organizing things by primitive arrays of data.

If memory is not a problem a copy of the relevant data per worker thread and splitting it up for data to potentially be processed by multiple threads is possible. Using a simple ArrayBlockingQueue (other options available under java.util.concurrent) to transfer data between threads can be fine because likely whatever processing that is occurring in worker threads is heavier than any synchronization aspects of the concurrent data structure used.

You can roll your own system with an Executor or you could even use fork / join in your main render thread. See this JGO thread:

nsigma · September 19, 2015, 11:05am

My instinct (based on no hard evidence whatsoever ) would be to go with maximum N-1. There are going to be lots of other things going on you don’t directly control, OS stuff, VM stuff (GC, JIT?), audio playback, etc. Some (most?) of those are going to have priorities higher than your threads.

Roquen · September 19, 2015, 11:06am

note: availableProcessors is the number of virtual CPUs.

theagentd · September 19, 2015, 11:55am

If you’ve coded the OpenGL parts correctly the driver will use a second thread to actually process OpenGL commands in so the main thread isn’t blocked for too long. Even with that thread, sound threads etc, it is faster to use all processors in my experience. Sound threads should already be running at a high priority to prevent stuttering, and they’ll spend most of their time waiting for the harddrive or idling with full buffers.

Your point being? Why would you not want to run your code on all logical CPU cores?

nsigma · September 19, 2015, 5:56pm

Like I said, instinctive thought! Still, be interesting to measure that in terms of throughput and latency / response time.

What dull sound threads! ;D https://youtu.be/lK94qu1iObo

Roquen · September 19, 2015, 6:03pm

There’s quite a difference between say: 2 full cores with hyperthreading vs. 4 full cores without hyperthreading. Ideally you want both pieces of information.

KaiHH · September 19, 2015, 6:54pm

True. Could do a JNI library providing that information using libcpuid.
I just test-built it under Windows x86 and x86-64 flawlessly.
Ideally, would do without any third-party library, and under Windows there is at least GetLogicalProcessorInformation, however apparently no such easy equivalent under Linux.
Anyone interested?

EDIT: For anyone wanting to try it out under Windows: https://github.com/httpdigest/jlibcpuid

Spasi · September 19, 2015, 7:39pm

I will be adding hwloc bindings to LWJGL soon.

Roquen · September 19, 2015, 8:19pm

If I remember correctly that has performance counter reading and a bunch of other good stuff.

ags1 · September 19, 2015, 8:44pm

From my experience of benchmarking, I found that going for more threads than processors gave higher and more reliable performance on windows, but the “extra” threads hurt performance slightly on linux. For example:

http://www.headline-benchmark.com/results/75cd37e4-1ebb-413d-aaa6-7defd480c05b/39377a27-a0af-4245-8b11-a0cf78202cf5

See how the performance is spotty until a large number of threads is being used. It seems the Windows scheduler often dumps threads from the same application onto the same processor…

theagentd · September 19, 2015, 11:27pm

My last try: Why? In what situation would you want to treat those differently?

nsigma · September 20, 2015, 12:29pm

My guess would be based on what I said above - throughput vs latency. I know various low-latency / soft-realtime audio projects, such as JACK (for which I wrote the Java bindings) recommend switching off hyperthreading in the BIOS. We won’t talk about their opinions on me using a garbage collecting VM to bind to JACK … Interesting Googling gaming, latency and hyperthreading - some similar opinions expressed.

mike_bike_kite · September 20, 2015, 7:35pm

[quote=“theagentd,post:13,topic:55527”]
HT is different to having a separate core. The main reason why they’re different is that full use of HT can only be made if you have relatively little code and a lot of parallel data to process. That’s why it’s more suited to video and audio work assuming the code is suitably written. In most games it’s far better to have a separate core to having HT and this show’s in most benchmarks where performance is similar between Intel i5’s (4 core 4 threads) and i7’s (4 core 8 threads). If HT can’t be used by a game then it’s important for the game to know how many real processors it has rather than virtual processors.

That’s not to say all cores are equal. Intel chips have been better than AMD since Sandybridge and AMD looks unlikely to catch up any time soon. A 4 core i5 will usually beat an 8 core FX chip - I’ve also seen a budget 2 core G3258 outperform the FX chips. AMD chips are good value though which is why people go for them. The big problem with having more cores is that they generate more heat so you need to run them slower. The other problem for AMD is that their FX chips shares certain components (ie floating point, instruction decoding) between each pair of cores so you’re effectively running half as many cores as they state.

So is it important to know the difference between HT and real cores? certainly not to me.

Mike

BurntPizza · September 20, 2015, 7:41pm

I’m looking forward to AMD’s Zen, which should be interesting; it has an equivalent of hyperthreading among other substantial improvements.

theagentd · September 20, 2015, 8:14pm

[quote=“mike_bike_kite,post:15,topic:55527”]
I know all that. I’ve done extensive benchmarking with Hyperthreading and have personally written code that gets 6.5x scaling on a hyperthreaded quad core. What I don’t know is why you would NOT want to use Hyperthreading if it’s available. Hyperthreading especially helps Java programs as we don’t have the same control over memory as C has. Let me rephrase the question: Why would I care if the cores are logical or physical? Why should that change the behavior of my program?

mike_bike_kite · September 20, 2015, 8:34pm

[quote=“theagentd,post:17,topic:55527”]
The programmer is trying to find out how many cores are available but is being told how many threads are available. If your program can use a new threads as easily as it can a core then it doesn’t matter ie for most video processing, audio processing and your program. Most games don’t benefit from HT so the program is being told misleading info from the call ie twice as many cores as there really is. If you then kick off additional processes to make use of those “cores” then the processor would have to swap those processes in and out. This may not be a huge overhead but it is an overhead.

Roquen · September 20, 2015, 8:46pm

The last time I did anything heavy in this area was ~5-6 years ago on 4 execution port hardware. What I was seeing in my use-cases was a hyperthreaded core would hit around .10-.15 the computation throughput of a full core. In some specialize cases which involved heavy tweaking it could be bumped up to around .30. These numbers are more or less in line with what others were seeing at the same time.

I have no specific expectation about how these number might look on newer 6 execution port hardware…too many variables, but as a guess it’s probably a bit higher. So attempting to develop a scaling scheme where you don’t know if the cores are all full or if half are virtual sucks because it’s a huge difference in expected computational power. Of course I’m not blowing off adaptive scaling. Perhaps of more interest is that in the hyperthreaded case I switched to using thread affinities which helped in that case and blinding doing the same hurt in all full core case.

But really I don’t understand your point. I’m saying that having more (and accurate) information is a better situation…even if any individual never makes use of it.

theagentd · September 20, 2015, 9:21pm

[quote=“mike_bike_kite,post:18,topic:55527”]
With HT each core can have 2 threads loaded into two different sets of registers at the same time and work on either one of them. One of the main points of HT is to AVOID the overhead of swapping threads. The other is to allow more efficient use of the CPU’s hardware since it can sometimes execute two instructions in parallel. That most games don’t benefit from HT is a symptom of them not being able to utilize multiple threads efficiently in the first place, something that as far as I know only the Battlefield series does at all, and those games get HUGE wins from HT. My old laptop with a HT dual-core gained around 50% higher framerates from HT as it was CPU limited. the only time Hyperthreading would hurt performance would be if you’re thrashing the CPU cache. In those cases doing that from twice as many threads may actually hurt performance, but that very rarely happens if you know what you’re doing.

Scaling in WSW:


Only physical cores:
1 core: 12 FPS (1x)
2 cores: 22 FPS (1.83x)
4 cores: 36 FPS (3x)

Using Hyperthreading:
1 core: 15 FPS (1.25x)
2 cores: 27 FPS (2.25x)
4 cores: 43 FPS (3.58x)

In some very specific cases the scaling is far better than that, but on average you’re right. HT helps a lot when you have a lot of cache misses or branch prediction failures, and cache misses are much more abundant in Java than in C, so Hyperthreading helps hiding those problems.

If AMD’s next architecture will implement a similar technology that could actually do wonders for them, considering their shitty memory controller holds them back so hard. Or it won’t. Who knows?

I’m not entirely sure what you mean, but the thread scheduler knows the difference between virtual and physical cores and prefers physical cores.

We have a name for unused information: useless information. I am asking you because I genuinely can’t see when you would ever be able to change anything to take that into consideration.