Render/Update Threads

blahblahblahh · January 9, 2005, 6:01pm

Except that most games don’t do this, and most AI algorithms are specifically designed (or tweaked from their original, non-gaming, versions) not to do this.

So, “needs” is not quite the right word :).

I’m a fan of QoS processing for games, so I rather like the AI algorithms designed to be multipass or time-limited (i.e. so that you can control to some extent from wihtin the main game-loop how much time the AI code will spend executing, responding in real time to peaks and troughs in the frame rate creation time).

To be honest, Java’s threading API’s are far too weak to do what you suggest IMHO, and the same is true for vanilla native threading, unless you don’t mind writing a differnt version of your game for each different OS version (cos they keep changing from version to version). This is one of the other reasons why single-threaded games dev is the norm: threading API’s are very poor in general, in terms of expressive power, with far too little control of the scheduler.

For instance, there’s no API call for me to say “The AI thread should be given between X ms and Y ms every Z ms, except when the frame-render thread has used more than P ms in the current frame or more than Q ms total over the 3 previous frames”. (note: this is the kind of logic that the scheduler has internally, it’s just that the OS doesn’t expose it to you as a programmer. There are some funky cool OS’s that do expose it, in great detail, and I love them, but … they’re research OS’s, or ultra-niche embedded systems, etc)

Which is exactly the kind of logic you need to do. The expressivity of plain threading API’s is way too poor. There are plenty of threading API’s that do provide this level of support, for specialized niches, but IIRC none of them are mainstream yet (please someone correct me if I’m wrong here; it’s been a while since I checked what standard threading API’s were). Given the amount of interest in recent years in massively improving linux’s threading systems, there’s a possibility that some big leaps could be about to appear. But…who cares until it’s also there on windows?

princec · January 10, 2005, 12:41pm

Like I keep trying to explain to people over and over again, threads are an operating system concept designed to keep processors busy when they are waiting for I/O to complete. They’re not for scheduling tasks, for which you either need: a RTOS and appropriate APIs (JSR001?); or algorithms designed to work in sequential steps.

Give the AI to a thread and the next thing you’ll find is that the gidrahs just sit around doing nothing because the AI thread is mysteriously never scheduled to run.

Cas

rreyelts · January 10, 2005, 3:06pm

[quote]Like I keep trying to explain to people over and over again, threads are an operating system concept designed to keep processors busy when they are waiting for I/O to complete.
[/quote]
Right - because multi-cpu machines are a figment of everyone’s imagination, and no one would ever write an application to take advantage of them. cough cough

Cas, I think you’re living too much in the past. Take a look at http://www.gotw.ca/publications/concurrency-ddj.htm. (Btw, in case you didn’t know, Herb Sutter has some pretty serious creds). Cpu performance just isn’t scaling like it used to. It’s why ALL the big names (AMD, Intel, IBM, Sun, etc…) have already gone multicore. Sun has special plans for massive parallelism via CMT (Niagra).

This isn’t way off in the future, Cas. It’s already happened. You’ll probably be programming on a multicore machine in two years.

God bless,
-Toby Reyelts

princec · January 10, 2005, 3:56pm

What WORA programming is about, is guarantees of service, and until a Thread is guaranteed to execute precisely the amount that we want them to, they’re no use to us who want to write code that runs reliably everywhere. The concept we want is not Threads, it’s “Tasks”, or whatever the realtime boys call 'em.

Case in point: in a LWJGL app, merely changing the main thread priority to Thread.MAX_PRIORITY causes a total freeze of the display under some ATI drivers because the ATI drivers stop getting any time (inexplicably).

Then of course there’s the whole synchronisation and complexity issue.

But who am I to tell anyone, eh, I’ve only written a few games

Cas

rreyelts · January 10, 2005, 4:50pm

[quote]What WORA programming is about, is guarantees of service, and until a Thread is guaranteed to execute precisely the amount that we want them to, they’re no use to us who want to write code that runs reliably everywhere.
[/quote]
That’s bullhockey Cas. I write multi-threaded code that executes on many platforms (Windows, Linux, Solaris, AIX, and Irix to name a few).

[quote]Case in point: in a LWJGL app, merely changing the main thread priority to Thread.MAX_PRIORITY causes a total freeze of the display under some ATI drivers because the ATI drivers stop getting any time (inexplicably).
[/quote]
Wow, you set a Thread to the highest possible priority, and then you’re surprised when it starts starving out other threads? First rule of thumb, Cas - You are almost always doing the wrong thing if you are setting a thread’s priority. You should manage allocation of time to threads through synchronization mechanisms, not through thread priorities.

[quote]Then of course there’s the whole synchronisation and complexity issue.
[/quote]
Yes, multithreading is not simple, and we’re seeing the proliferation of tools to deal with that complexity - new languages (Erlang), new language tweaks (Flow Java), and new libraries (java.util.concurrent). Aside from that, if you think people are going to prefer your simpler non-concurrent game, when it’s only using one-half or one-quarter of the processing power other more-complex concurrent games are using, I have a news flash for you.

[quote]But who am I to tell anyone, eh, I’ve only written a few games
[/quote]
Who am I to talk about multithreading? I’ve only been doing it professionally for a decade on apps ranging from medical imaging to remote app servers. Oh, and another tip, Cas. Realtime scheduling is about anything but full-blown performance. It’s totally about QoS. Most apps that require realtime scheduling (typically embedded devices) don’t need any kind of real performance (cause they’re running on a 2Mhz microcontroller as it is).

God bless,
-Toby Reyelts

rreyelts · January 10, 2005, 5:37pm

I thought it was apropos that Slashdot just posted this to its front page: http://slashdot.org/article.pl?sid=05/01/10/1839246&tid=142&tid=118

God bless,
-Toby Reyelts

kevglass · January 10, 2005, 6:09pm

Arn’t we just talking about different scales of complexity here?

Multithreading is positively worth it if you’re going to get multiple CPUs involved. However, for the style of games Cas is talking about thats simply not the target market. If you’re using a single processor machine thread isn’t going to give you an significant performance increase since there always be overhead in swapping threads. By segmenting your main loop into sections all you’re actually doing is creating a specialized scheduler yourself instead of relying on the more generic one.

You could rely on the thread management of the OS using its heuristics to decided which thread should get more time or you could use some domain knowledge to decide before hand.

Other than the lack of control over scheduling (and the added complexity in getting that control) I’m struggling to see what the significant difference is between a custom game loop and threading everything seperately.

I’m probably just dense but I’d quite like to understand.

Kev

princec · January 10, 2005, 6:32pm

Basically, that’s about the size of it. You have to code to assume a single processor and a crap OS scheduler. Well, not even crap - just not guaranteed predictable. This is totally about QoS - if the AI doesn’t work on some computer then it doesn’t work.

And what’s more we have to write to assume a certain amount of work is done on every task irrespective of the power of the target CPU.

By day, I write big, big multithreaded multitier client server apps.

Cas

rreyelts · January 10, 2005, 6:43pm

[quote]Basically, that’s about the size of it. You have to code to assume a single processor…
[/quote]
No you don’t. Well-written multi-threaded code runs fine on an arbitrary number of processors, Inf > N > 0, just like well written graphics code scales from a Radeon 9000 to a Radeon 9800 Pro.

Again, your programs are just going to be outclassed Cas, when the average desktop computer is a multicore system.

God bless,
-Toby Reyelts

princec · January 10, 2005, 7:49pm

But meanwhile the multithreaded is going to fail. And fail it will. But we must now agreee to disagree, as we are on a level about 3 layers of Heaven above the OP and waaay off topic!

Cas

Vorax · January 10, 2005, 7:52pm

If you are wondering how you can easily schedule time for AI, collision detection, animation etc. with threads, here is a copy-paste from my games engine thread code.

This is the main thread loop for the engine (not the renderer)

while (!stopped && !this.isInterrupted()) {
  newTime = System.currentTimeMillis();

  // Integrate mouse and keybaord effects on player position
  engine.player_camera.update();

  // Update active entity AI
  updateEntityAI();

  // Execute geometry updates for next frame
  // (executeNext obtains geomety lock if renderer has finished buffer swap)
  executeNext();

  // Did we have left over time to give back to the renderer?
  // TODO: for J15, use nano
  long delta = newTime - System.currentTimeMillis();

  if (delta < desired_time) {
    // Give unused time back to renderer or awt threads
    while (System.currentTimeMillis() < newTime + (desired_time - delta)) {
      Thread.yield();
    }
  } else {
    // Renderer thread or other CPU process is starving us
    // (Share the CPU as best we can)
    Thread.yield();
  }

  // Figure out how long this all took
  // TODO: use nana for J15
  long elapsed = System.currentTimeMillis() - newTime;
  currentFPS = (int) (1000 / elapsed);

  // Target elapsed time should be 15 ms (aprx 62 FPS)
  // Adjust lagMultipler accordingly for the next 
  // engine frame
  engine.setLagMultiplier(currentFPS / optimal_rate);
}

The call to executeNext() is important because that is where a synchronize block happens to pass reference geometry to the renderer, after all collision detection, animation, etc has happend. excecuteNext is a command pattern executer that determines which objects are invalid based on how often they wanted to be woke up. Each active entity can operate at upto 62 game frames per second (they get called back if it is time for them to work on their movement or their geometry…or whatever they do).

This is my main rendering method (not the loop…see below for that). The syncrhonzation block ensures that the engine can not pass a new reference to geometry (inside engine.world) while actual drawing is taking place.

public void display(GLDrawable drawable) {
if (!engine.engine_running)
return;

// Let the engine know we are about to draw a new frame.
engine.newFrame();

gl.glClear(GL.GL_COLOR_BUFFER_BIT | GL.GL_DEPTH_BUFFER_BIT);

// Per frame things that are non-visual and non-thread dependent can be done within this call
nonVisualPrep();

// Prevent geometry from being updated while we are doing
// the actual rendering
synchronized (engine.frame_sync) {
  // Take ownership
  engine.setActiveThread(VoraxEngine.RENDERER);

  // Clear the draw later que every frame
  clearDrawLaterQue();

  // Need orientation help?
  if (VoraxEngine.DEBUG)
    drawGrid(gl);

  // apply the camera
  engine.player_camera.apply(gl);

  // Render the world
  engine.world.render(gl, matStack);

  // Draw items in the draw later que
  for (int i = 0; i < drawLaterQue.size(); i++) {
    ((GeoObject) drawLaterQue.elementAt(i)).drawObject(gl);
  }

  // Show fps?
  if (engine.show_fps)
    drawFPSText(gl);

  // Show engine fps?
  if (engine.show_fps)
    drawEngineFPSText(gl);

  // Geometry stats?
  if (engine.tris_stats)
    drawTrisCountText(gl);
}

}

This is the main loop that calls display() above:

while (engine_running) {

  // Render...
  renderer.display(drawable);

  // swap the buffers
  drawable.swapBuffers();
  
  // Don't overdraw and waste CPU, give it to the
  // engine or awt threads
  if (gl_fps > config.refresh_rate  || engineThread.currentFPS < engineThread.optimal_refresh)
     Thread.yield();      
}

Two things to note here are that the buffer swap is not within a synchronization block. This is safe because only the renderer writes to the context. By having it unsynchronized, the engine can start working on the next frame as soon as the renderer has finished the last gl call.

The other is the yield within this loop. If the renderer thread has already reached the refresh rate, it will not draw the next frame without yielding some cpu time to the engine or awt threads. There is no point as it only wastes cpu time wich would be better spent on preparing the data for the next frame.

That’s the high lights. The engine thread runs at a constant 62 FPS and the renderer thread goes from worst case of 54 fps (with 28K geometry being rendered) upto 1800 fps when looking at nothing (I said 400 before… I found some bad code that was causing a chained recalc of bounding boxes down the SG path…big difference with it removed)

That code isn’t really complicated. As you can see I can test to see if the engine thread is actually benefiting anything at all by simply calling executeNext() inside the main display() method. When I do, I get this:

(unknown engine FPS)
Worse case scenario is now 42 FPS for renderer. So as it stands, the engine thread is gaining me 12 FPS under the most highest geometry conditions. 12 FPS is a pretty substantial gain to me because I am already well below the target FPS of 85 for the renderer thread.

As I add more geometry, this will get even worse. I willl balance it out by using a stripifier on the data and vertex lists were possible (some for static geometry now). So, if I double my geometry rendering speed with a stripifier, I will hit around 100 fps. Without the engine thread I would only be hitting around 84 (best case), and I wouldn’t be able to add any more geometry. The AI is also going to get alot more complicated, but will have an almost unoticable effect on rendering FPS becuase it happens in the engine thread. With the performance gains from the threading I will be able to add approx, 15% more geometry to the worst case scenario and still achieve my desired FPS.

edit: I almost forgot the other point of relevance to this thread…the engine FPS is constant and seperate from the rendering, so all players of the game (despite hardware, or network if I add it) will have the same experience.

…are the advantages becoming clearer yet?

blahblahblahh · January 10, 2005, 7:55pm

Nah, what’s going to happen is that AMD, Intel, etc are going to have to fund much much better threading API’s that make it feasible to do rich custom scheduling hints or else see adoption of their processors by developers falter. There’s far too many people around who are going to carry on just coding to a single thread (much as I wish it were otherwise) for a long time to come.

Isn’t the state of play already that a lot of work goes into hardware and OS mechanisms to make single threaded programs run a little faster on MT processors? ISTR considerable effort from compiler writers (Intel, IBM) because the obvious problem that most coders won’t change habits easily.

blahblahblahh · January 10, 2005, 7:58pm

[quote]If you are wondering how you can easily schedule time for AI, collision detection, animation etc. with threads, here is a copy-paste from my games engine thread code.
[/quote]
Looks to me like co-routines. I suggest you grab a co-routine library and see if it would improve your code at all.

Vorax · January 10, 2005, 8:07pm

Ya, it’s the same idea. This is very lean and does only what I need, but if more threads get involved I might grab the library and adapt the code.

rreyelts · January 10, 2005, 8:39pm

[quote]Isn’t the state of play already that a lot of work goes into hardware and OS mechanisms to make single threaded programs run a little faster on MT processors?
[/quote]
You’re not making much sense to me. The only “magic” in any current compilers is some auto-vectorization that Intel has been working on for quite some time. That has nothing to do with MT.

[quote]the obvious problem that most coders won’t change habits easily.
[/quote]
For “most coders”, MT isn’t much of an issue, because they don’t need MT or they’re already using a framework that takes care of the big MT issues for them - like J2EE. The same thing is probably happening for gaming companies, where the big game engines take care of most of the MT design for you. The rest of us can get along just fine writing MT code ourselves.

God bless,
-Toby Reyelts

Vorax · January 11, 2005, 1:06am

Well, I just did some more refinements and performance tuning on the engine and things look a little better for non-threaded mode.

I also modified the engine so it can run in threaded or non-threaded with a static final boolean setting (before I had to copy and paste some code into the rendering loop).

New engine with worst case scenerio:

Non-threaded Mode:
Renderer FPS: 65 (improved by 23 FPS)
Engine FPS (N/A)

Threaded Mode:
Renderer FPS: 70 (improved by 16 FPS)
Engine FPS: 62

That brings the non-threaded version to within 5 FPS of the threaded version (it was 12 slower). Which is not bad.

That said…I still want the constant game engine rate of the threaded version, and of course, those 5 other FPS which will be more like 10 FPS once stripped.

Having a constant engine rate makes everything much easier and predictable IMHO.

Either way, I am very happy with the performance gains.

princec · January 11, 2005, 8:02am

Years ago, I wrote that crappy terrain demo to show off Java to various people, and it had a multithreading switch in it that took advantage of a second CPU. It took a reasonable amount of code to make it work correctly (properly correctly, not just “it happened to work on my machine”), and for a gain of just 10% in performance on a dual CPU machine.

However being a realist I find that the extra development time over the top of the single threaded baseline (and it’s huge if you do it right) does not justify pandering to giving a 10% performance boost to the 1% of the market who currently have dual CPUs and will have for the next 5 years.

ISTR Carmack gave Q3 a multithreaded switch too, and he reported similar results.

Cas

blahblahblahh · January 11, 2005, 8:32am

Part of the problem there, though, is that he was multithreading multiple intel chips.

Intel chips are so bad at multi-processor work it’s scandalous. Double the number of processors and you get +40% performance. Huh? Yeah…god-awful cheap-ass mobo design (no crossbar switch, each CPU has exclusive access to RAM, so they have to fight each other for access).

Would be worth trying again on an AMD multiprocessor mobo and seeing what difference it makes.

Then again… to what extent was that demo graphics card lmited or CPU lmited?

Vorax · January 11, 2005, 9:44am

[quote]Years ago, I wrote that crappy terrain demo to show off Java to various people, and it had a multithreading switch in it that took advantage of a second CPU. It took a reasonable amount of code to make it work correctly (properly correctly, not just “it happened to work on my machine”), and for a gain of just 10% in performance on a dual CPU machine.
[/quote]
In C, threads are a pain in the arse (or were when I last used them with it…long ago), in C++ they are not to bad at all… in Java they are so simple it would be hard for me to justify not using them. Especially if they can get me 10% better performance (or even 2%) while greatly simplifying timing issues wich will reduce the code complexity later. Since Java is starting of with a speed disadvantage when compared to C and C++, it’s even harder to ignore.

Carmack was using C and multi-processor support was only for server IIRC.

princec · January 11, 2005, 9:53am

If you can reliably confirm a minimum number of clock schedules that your threaded code will get per second to execute your AI, on Linux, Win32, MacOS X, on both 1.4 and 1.5, then you’ve sold the idea to me.

But you can’t, so you won’t

Cas