4 or more threads in render loop

jakethesnake · October 3, 2014, 10:16am

Hello. The game loop of any java game is a thread in itself. Would it be an idea, since multiprocessors are getting more and more cores, to divide up your update and render method into several threads, so that all cores can be fully utilized?

Gibbo3771 · October 3, 2014, 11:05am

no

Roquen · October 3, 2014, 11:18am

Only if you’re software rendering.

princec · October 3, 2014, 11:28am

Well… I’ve successfully done it and achieved a pretty tidy factor of speed increase… but as usual the thing with threads is, if you actually have to ask this sort of question then the answer for you, at this time, is “no”.

Cas

jakethesnake · October 3, 2014, 11:54am

Well, I understand that it can be done in an update loop if one knows what one’s doing and not messing up the data. But what about a render loop? Theoretically, if you have an image you want to display on the screen, would it yield an increase in optimization if the image is divided into four squares and four threads gets one piece each, which they in turn render? Or will it all be clogged up in some bottleneck on its way to the display, rendering the effort useless?

ags1 · October 3, 2014, 7:21pm

Don’t forget your 4 CPU threads are feeding 1 GPU.

Catharsis · October 3, 2014, 8:29pm

I’d be curious if Cas would like to share more info on what aspects of his engine efforts were split between threads.

Typically, an update / render loop should be one thread. Where multiple threads really come into play is any marshalling of data between network and depending on the use case journalling (could be useful for debug / replay of game sequences / especially network data). I’d say look at the Disruptor architecture for a solution and technical info in this regard:
http://lmax-exchange.github.io/disruptor/

For game engine use cases I can see streaming data to the GPU / say new texture or geometry data for large levels being facilitated via threads on the CPU and use of the GL threading API.

In my OpenGL ES video engine for Android I handle encoding and rendering to the screen in separate threads that share two GL contexts for texture / FBO data. This allows 30FPS+ of rendering to the screen and the encoding at the same time where if these operations occurred sequentially in the same CPU thread then performance for both drops to ~15FPS.

As a baseline you’ll want to examine the threading API of OpenGL ES 3.0 for the essentials of coordinating GL operations. ARM has a good tutorial:
http://malideveloper.arm.com/downloads/deved/tutorial/SDK/android/1.6/thread_sync.html

In general though as has been mentioned already; unless you have a specific use case the answer is usually no; single thread for update / render.

Riven · October 3, 2014, 8:32pm

update( ) and render( ) run on the same thread, but their implementation can do a fork-join to spread some specific heavy workload over all available cores.

Examples: AI / pathfinding for N units, filling multiple VBOs with vertex data.

princec · October 3, 2014, 10:22pm

In my case the sprite engine I use animates, sorts, transforms, and writes out all the sprite vertex data using a thread-per-core and achieves a very tidy speedup as a result (when we’re talking tens of thousands of sprites). I do a few other things using multithreading such as particles and emitters, but unfortunately my current game requires deterministic processing so I couldn’t do AI on multiple threads (which would have been great).

Cas

Catharsis · October 3, 2014, 11:10pm

Chatting a bit about data synchronization may give some bread crumbs to the folks interested in multithreading CPU side.

To achieve synchronization do you use double the memory and have active render buffers and buffers to fill and swap them when ready with an AtomicReference, etc?

I’ve found in general synchronization use cases for game / video engine dev that the CAS (compare and swap) operations are sufficient for synchronization CPU side without having to go full bore ala ring buffer or the direction the Disruptor takes for really high performance throughput. Because one is dealing with full buffers of data and not many discrete individual events / data updates CAS operations are plenty quick to handle synch for typical game engine use cases.

The Disruptor architecture though is pretty bad ass because it allows creation of a producer -> multiple consumer dependency graph; IE really useful for network marshaling and journaling that has many discrete events / data leading to one business logic / game logic thread, then back out via another Disruptor chain. And that is a basic example.

Roquen · October 4, 2014, 9:54am

I thought we were talking about rendering loop. Everything changes if you’re talking about the engine as a whole. If you need disruptor, then you’re over-engineering.

jakethesnake · October 4, 2014, 10:58am

Yea, well I was just curious if multithreading was something that was done in java-games. And if there’s an advantage to be gained. Or if this is done in the JVM itself. Makes sense since having four threads running, producing a large prime integer and printing it on the console will be faster than one thread producing four large primes…

princec · October 4, 2014, 11:25am

I wouldn’t know a disruptor if it came up and bit me on the arse, but I think what I’m doing here for performance is basically “rendering” (less the particle logic and such).

Cas

Roquen · October 4, 2014, 1:14pm

I’m speaking kinda generally here. You might need something like disruptor if your doing an MMO. You might use multiple threads for rendering…a case the comes to mind is software occlusion queries (I don’t even really think of that as rendering either…just a rendering related task, like simulation). Generally I’d say you want a thread to be as independent a task as possible…and if you ever starting think about moving beyond single-producer/single-consumer…you might want to step back and re-think and make sure that’s what you really-really-really want to do.

Catharsis · October 4, 2014, 7:27pm

Yes, yes, yes; a little less misdirection… :-* Rendering is the main discussion… re:

I posited a direction of least resistance in regard to how one can handle synchronizing a multithreaded CPU side buffer filling mechanism as you outlined. For others and heck even myself are you interested in commenting on the synch mechanisms you use. IE 2 buffers swapping an AtomicReference when ready and the render loop / thread just chugs along picking up the buffers from the current ones stored in the AtomicReference oblivious of any multi-threading going on to fill those buffers. The question was generally how you are solving the synch issues with the worker threads and render thread since it’s not a fork / join type task. This will help others think about the problem and come up with a solution.

While I don’t have an optimized sprite engine I do have a general computing (grid counting demo) that is implemented in OpenCL and various CPU implementations from serial to multi-threaded using a similar mechanism as I described above. The multithreaded CPU version gives about 6x improvement in speed over the single threaded solution when mapped to approximately the same number of threads as cores on the CPU. Is that about what you are seeing? It’s nice because I can use a software OpenCL implementation to test if my Java multithreaded implementation is efficient and it’s pretty close.

So was I here on general patterns. The Disruptor pattern works well for high throughput discrete event passing. The way LMAX uses it for even higher performance is the slots in the ring buffer are actually byte arrays instead of objects. The reason I’m keen to the Disruptor architecture (let’s be honest here; it’s a fancy ring buffer; are we scared about talking about ring buffers here because there is a name attached to a particular use case / implementation?) is that what I’m doing is highly event driven via the EventBus pattern (though different implementation to Otto or Guava’s) meaning that it is a good fit for my architecture in general for data coming in and off the wire. Stop me now before I talk about the “DisruptorBus” : I’d have dug into this area a lot more already if it ran on Android and didn’t depend on sun.misc.Unsafe aspects which aren’t supported on Android. Like all things this is infrastructure for a particular use case / problems at hand (certainly leaning toward MMO scale as Roquen mentioned) and as an engine developer potentially allows me to provide APIs to general developers, you know game developers, who won’t have to think about the complexities of threading under the hood.

My uses of CPU side multithreading presently are much more close to the complete buffer filling and swap scenario as originally discussed.

princec · October 4, 2014, 9:49pm

I use a circular buffer of VBOs and just pick the next one each frame. The only synchronising is done with OpenGL depending on what version of OpenGL I’ve got.
But basically it’s


glMapBufferRange(... GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_RANGE_BIT ...);

Using a circular buffer of several VBOs means I don’t seem to fall foul of synchronisation issues. Ideally I’d use the newfangled way of doing things where you permanently map the buffer and then manage it yourself but that’s only available on pretty new GPUs and I’m sticking to OpenGL3.0 as a baseline.

There’s no other synching going on anywhere else - it’s all just parallel workloads made by chopping stuff up into equal size chunks, one for each core. It’s not the very most efficient way to do things but it’s easy and simple and gives great results for the effort.

Cas

Catharsis · October 4, 2014, 10:48pm

This sounds similar to the generally naive approach I take with the render / encoder threads in my video engine for the OpenGL ES 2.0 implementation except replace VBO w/ FBO; just use a blocking queue that the encoder thread waits upon. It works in practice on every device I’ve tested, but could potentially be unsafe. I’ll be beefing things up a bit w/ OpenGL ES 3.0 threading API quite likely soon.

What you describe sounds fine for one worker thread filling buffers and one render thread. Likely the worker thread will always have the next buffer full before the render thread renders it. I’m trying to figure out the multiple producer / single consumer scenario you mention without synch. Say 100k sprites and each worker thread (say 5) are dealing w/ 20k sprites each filling a single buffer (sync issues). Or is it 5 worker threads each filling their own buffer of all 100k sprites into a unique VBO in the circular buffer of VBOs (starvation in rendering or skipping to the most recent buffer filled when render occurs / IE wasting CPU on worker threads).

Obviously it’s working and you see an improvement. Just curious and all…

princec · October 5, 2014, 10:11am

Sprite engine render method:
http://pastebin.java-gaming.org/4816e91800a12

VBO class:
http://pastebin.java-gaming.org/816e1009a0211

Threadulator:
http://pastebin.java-gaming.org/16e101a920115

It’s nothing fancy… took me weeks to perfect it mind. Maybe someone can find some bugs or ways to make it faster.

Cas

CopyableCougar4 · October 5, 2014, 11:09pm

(Disclaimer: I hadn’t heard of a ForkJoinPool until now)

The [icode]ForkJoinPool [/icode]just blew my mind Now I just want to find a place where I can add something similar to your Threadulator!

CopyableCougar4

Riven · October 6, 2014, 5:36pm

VBO class 169: check for availability