Threaded game engines

We had this discussion before and many seemed to believe that threading a game engine in Java was a waste of time and developers should stick to the standard single-loop model.

Since my game is getting a bit further along and pushing things alot harder, I thought I would share some numbers to get the discussion going again for those that are performance junkies like myself.

Threaded:
FPS: 102
Engine: 62
Verts: 201,492

Non-Threaded:
FPS: 81
Engine: n/a
Verts: 201,492

Both on an AMD2600 Geoforce 4MX looking at the same scene with the same game options.

When running with threads I am getting 126% better performance (well beyond my expectations). Which really illustrates just how much time is wasted in a single game loop.

Here are screen shots of the tests as well as a preview of the in-cockpit view of “Exigent” (tentatively named):

Threaded:

http://www.imagehosting.us/himages/ihnp-226544thumb.jpg

Non-Threaded:

http://www.imagehosting.us/himages/ihnp-226541thumb.jpg

Let the debate commence…or agreement :slight_smile:

The question is: what is your single-threaded version wasting its time on? In theory it should be flat out 100% CPU at all times if you’re performance testing…

Also it’s worth noting that you use JOGL which is already a bit strange, threadwise. If you were doing it with LWJGL I suspect there may be a subtle difference.

Cas :slight_smile:

Its not doing anything more then the threaded version, infact it is actually doing less (though only slightly) and never has to wait on synchronization blocks.

When in non-threaded mode, the renderer loop calls the same dispatcher/validator that the engine thread calls when its in threaded mode. That is why the game functions the same and switching the modes is a flip of a final static boolean (at compile time).

They both do the same amount of work, basically issuing timed callbacks to any animation handlers that have expired and recalculating invalided geometry resulting from those callbacks, input handling, AI, collision detection, etc… The difference is when they can do the work. In the single threaded, any IO blocking, blocks everything, in the non-threaded, it gets a chance to regrab the time and put it to use. The more geometry I add (thus more IO), the bigger the gap seems to become between threaded and non-threaded.

I am partially wondering if this gap is something to do with 1.4.2 Java though. When I was using 1.5 the threaded was still faster, but the gap was only 115% (though there wasn’t as much geometry so this might be invalid), better performance in the NIO and JNI calls in 1.5? Not sure.

This could be related to JOGL, I may make the OGL layer swappable later on, it will be interesting to see if LWJGL has similar results.

I don’t get where your game could be blocking. What I/O could you be doing that blocks?

Cas :slight_smile:

A few things I can think of which will be IO or synchronization bound will be:

Inside the rendering loop:

  • Transfering large amounts of vertices
  • Replacing textures very often (20-30 times a second in my case for some of them)
  • Use of glGet for retrieving the view frustum
  • The buffer swap (not directly blocking but can result in GL calls right after it being blocked until it’s finished. Especially with smaller AGP aperatures or crappy video cards…like mine)

Inside the engine loop, all operations are blocking becuase it’s not writing to a buffer like the GL calls (it’s just executing necessary code).

I believe that the gains are realized when the GL code is causing a wait of some kind (those things above), becuase the engine code can get more dispatching/calculating done. If the engine can dispatch some of its animation callbacks (resulting in potentially several thousand vertices being transformed, bounding boxes calculated, collisions determined, AI, etc) during the GL synchronizations, then the resulting effect would be pretty dramatic becuase those objects will now be ready to push down the wire on the next frame instead of having to be calculated at the start of the next frame.

Nice work Vorax.
I really would like to see results from others to make sure that this can be easily repeated.

I’m still not entirely convinced that multi threading is the way to go under games for all circumstances.

None of that is actually “I/O”, in the context of Threads. In fact threading will normally slow these operations down considerably from cache pollution. glGet might conceivably block but it only needs to be done once per frame typically and it’s very fast really. Not anything like enough to give you a 25% framerate boost.

That’s dead right - in fact this is probably where you’re losing all your efficiency. The buffer swap should be the very last thing you do before you start doing “logic” again. During “logic” there should be no need to call any GL API commands. Not that all GL commands actually are blocked - clientside GL commands can be executed without any trouble.

If vsync is enabled this command may block though and that’s for certain somewhere where you might waste a considerable number of cycles doing nothing, unless the vsync has been implemented correctly.

Cas :slight_smile:

I agree those wouldn’t be near enough to explain 25% boost I only thought they may be part of it along with the buffer swap. I cant see how transfering megs of data around per second can’t incurr some waste in CPU time. Unless there is some magic in Jogl and NIO that I don’t know about. :slight_smile:

The non-thread version does the bufferswap, then makes the call to the dispatcher for calculations, just as it should to avoid the potential swap blocking.

I just did another experiment that proves there is definite waits going on within the Jogl and/or nio calls.

I removed all calculations and reduced the game to nothing but scene rendering. Just geometry pumping.

Non-Threaded:
FPS 187

Threaded:
FPS 187

This means there has to be lost CPU cycles that can only be regained by using another thread. In both cases I removed the call to begin dispatching for calculations. If there were no loss, then the FPS rates would be equal when the calculations were done as well, but they aren’t.

Time to wheel out that profiler and find out what’s blocking.

Cas :slight_smile: