Performance Slowdown (LWJGL, OSX)

Morgan_Allen · May 9, 2015, 8:28pm

Hi there- Quite a while back I was posting about an SF citybuilder project of mine. It’s not quite dead yet, but I’ve run into a significant performance bottleneck on larger maps and I couldn’t seem to track down a consistent culprit for the slowdown.

Anyway, I ran the program under a couple of conditions using VisualVM and came up with the following fairly consistently:

http://s21.postimg.org/d5g65varr/Screen_shot_2015_05_09_at_18_38_16.png

Which seems to be pointing the finger at stuff LibGDX is doing internally. Now I know this is only happening on larger maps, so it’s probable that I’m inefficiently allocating some kind of repetitive task to the engine where the cost is only evident later in the update loop. But I don’t quite know enough about the engine to pinpoint how I might be misusing it. Do those methods look familiar to anyone?

(There are public releases of the game available at the second link, so if anyone wants to take a gander, feel free.)

Morgan_Allen · May 9, 2015, 9:43pm

Sorry, my mistake- I just realised those are LWJGL methods being called, so nothing in particular to do with LibGDX. Still, any pointers on what might be causing the slowdown would be very helpful.

theagentd · May 9, 2015, 10:31pm

That those two functions take a lot of time indicate a GPU bottleneck, OR a driver CPU bottleneck. If you have an Nvidia GPU, you can disable Threaded Optimization in the Nvidia Control Panel, which will disable the driver multithreading and might give you more information about a possible CPU bottleneck. You can also check the GPU load with GPU-Z and see if it’s at 95%+, in which case your GPU is the bottleneck. Lastly, to figure out what exactly is slow, you can use GPU timer queries.

Morgan_Allen · May 10, 2015, 10:29am

I appreciate the tips, but I’m using an older Macbook running OSX 10.6.8 and a NVIDIA GeForce 320M. I don’t think there is a NVIDIA control panel by default on the Mac (and alternative drivers seem to need a more recent OS version. GPU-Z also seems to be windows-only.) I will try looking into GPU timer queries, though.

I kinda need to upgrade my machine.

Morgan_Allen · May 10, 2015, 12:54pm

Alright. I tried running on WinXP using Boot Camp to see if this was specific to mac drivers. There was still some pretty noticeable ‘jittering’, but I don’t think it was quite as bad as on OSX. Could be my imagination, though.

GPU-Z seemed to indicate that the graphics card was running at 30-50% capacity most of the time, with occasional spikes to 60-70%. So… room for improvement there, but probably not the bottleneck? I’ll have to install the JDK on boot camp before I can run Visual VM, though.

I might get one of my testers to run the game on a larger map, just to see if it’s specific to my hardware setup.

Morgan_Allen · May 10, 2015, 3:18pm

I got VisualVM installed on WinXP and a second macbook running OSX Yosemite.

The results on Yosemite didn’t show time wasted on nSwapBuffers, but I was still seeing noticeable jittering, especially when creatures/citizens were more active (e.g, during daytime hours.) And a huge chunk of CPU is still being spent on nUpdate.

http://s23.postimg.org/ailmw6udn/osx_yosemite_machine_specs.png

http://s8.postimg.org/csiwgjrpx/osx_yosemite_profile_methods.png

The results from WinXP (same old macbook, running on Boot Camp) show a somewhat different picture. Total CPU consumption by LWJGL is only 48% (versus 82% on OSX.) The pattern was similar though- slowdowns/jittering, especially during game-daytime.

http://s23.postimg.org/nb9qw45zf/osx_leopard_specs.png

http://s23.postimg.org/ob01rtl57/win_xp_profile_methods.png

I think it’s reasonable to conclude that driver differences are accounting for part of the slowdown on OSX, but my simulation code must be causing some additional spikes in usage, even if the average CPU burden on that side is low.

Still, I’d like to be able to address those driver problems if possible. I’ll try to integrate those GPU-timer-queries next and see what happens.

theagentd · May 10, 2015, 4:37pm

If you have a CPU bottleneck, the GPU timer queries are worthless. The GPU will be limited by how fast you can provide commands to it and will idle when no commands are present, hence the values will be inflated by this.

This looks like a pretty clear LWJGL bug/problem. Try to update LWJGL.

Stuttering on the other hand could be coming from something in your code. VisualVM will give you the average performance of the application, but it won’t show you spikes. If suddenly you have a 20-40ms spike every second, it’ll only take 2-4% of the CPU time but cause easily noticeable stuttering. TerrainSet.refreshAllMeshes() seems like the obvious place to start looking. Time it using System.nanoTime() and see if the spikes are your problem.

KaiHH · May 10, 2015, 5:12pm

I haven’t looked at your code, but some drivers like postponing the rendering work up to the point when the result need to be displayed, and that would be at swapBuffers, since then the rendering results must be produced and made visible on the screen.

Drivers can hold back draw calls and state changes until swapBuffers, because then they have recorded the most information about what you are actually doing in a frame and they can then optimize the draw calls and state changes optimally.
I had once an incident where not calling swapBuffers would lead to a crash of the application, because the driver was flooded with draw calls and state change recordings and gave up after some time. That was only explainable when assuming that the driver in fact did record some information between frame boundaries (where a frame boundary would be when swapping buffers) and held back the actual rendering until swapBuffers.

So, unless you are rendering to an FBO and feedback rendering results to other draw calls, then this might be the reason for swapBuffers taking the most time.

Some other drivers (your Windows XP) seem to avoid that batching and do some of the work already in glDrawElements.
So my guess would be that you just have a lot (in absolute numbers) of draw calls per frame and neither LWJGL nor the drivers can do anything about it.

You might consider using instancing or batching multiple geometries into one, if that fits your application to reduce the number of draw calls or state changes.

theagentd · May 10, 2015, 5:41pm

I have not heard about any drivers doing this. What I have heard of is that drivers offload all OpenGL calls to a separate internal driver thread, which essentially makes the draw calls almost free on the game’s thread. When the driver determines that the driver thread has fallen too far behind (either if the driver thread can’t process the commands in time or if the GPU is not fast enough to consume them), it forces the game’s thread to wait for driver thread, which in my experience happens in nSwapBuffers(). Most drivers seem to implement this as a busy loop. I have never seen nUpdate() appearing in VisualVM since that one is usually free, which indicates an LWJGL problem.

KaiHH · May 10, 2015, 6:10pm

I’ve heard about it here: http://renderingpipeline.com/2014/06/whats-the-big-deal-with-apples-metal-api/

[quote]…This is one reason why most graphic drivers collect all draw calls (and other tasks which should be executed on the GPU, e.g. data transfer, state changes etc.) for the whole frame before sending them to the GPU. Those buffered command will then be send at the beginning of the next frame and thus use the GPU as efficiently as possible.
[/quote]
And it kind of makes sense, in order to utilize the GPU in the most optimal way, regardless how wickedly the OpenGL or Direct3D API gets abused by the application developer.
But I have yet to see any driver code for this.

theagentd · May 10, 2015, 6:20pm

Yes, the driver collects a number of draw calls and ship them all off after a certain threshold is reached, but they do not usually buffer all commands for a whole frame. On mobile, this may be different though. In those cases, the GPUs are often tile-based deferred renderers, which means that they do buffer up all rendered draw calls and state and then render the frame when it is resolved, but this can also happen on FBO switches and on read-backs of FBO textures, etc, so they still do not usually buffer a whole “frame”. That being said, my knowledge of mobile is limited.

What I can say for certain is that in no way do the driver always defer all draw commands until you finish the frame, as there are numerous ways to trigger driver thread flushes (mapping a buffer) and entire GPU flushes (reading back data).

KaiHH · May 10, 2015, 6:43pm

Thanks for the info on mobile!
Would be interesting, though, knowing to which extents the driver would buffer up commands if no commands are being issued that depend on the effects of previous calls.

Regarding MacOSXContextImplementation.update, it indeed looks like a bug (or misbehaviour) on the LWJGL2 side.
That update() method eventually calls into the Cocoa API [NSOpenGLContext update] method and according to the OS X documentation this method should not be called every frame. Only when the window resizes or moves.
But as it seems this method is being called whenever one calls Display.update(), which one would do every frame to also swap buffers.
Gonna tell Spasi about this one.

The (reverse) stack trace is:
Display.update()
-> Display.update(boolean) <- also calls swapBuffers()
-> Display.processMessages()
-> MacOSXDisplay.update()
-> ContextGL.update()
-> MacOSXContextImplementation.update()
-> Java_org_lwjgl_opengl_MacOSXContextImplementation_nUpdate
-> NSOpenGLContext.update

theagentd · May 10, 2015, 7:26pm

Nice catch! Like I said, swapBuffers() generally works as a “filler” when the GPU and/or driver thread still has work to do, so the fact that it barely appears in the OSX profile log points to something else taking a lot of time. There’s a big chance that this bug is in LWJGL 3 too. Would be nice if someone could confirm that. Still, the stuttering that Morgan is getting is probably coming from his own code, although the CPU performance handicap from that bug is probably making the problem more noticeable since he has a smaller time budget to play with.

Spasi · May 10, 2015, 8:32pm

The code that calls [NSOpenGLContext update] was written in 2004 and seems to be related to AWT integration. It indeed seems wasteful to call on each Display.update(). If anyone feels comfortable enough to fix the code, please submit a pull request.

LWJGL 3 does not call [NSOpenGLContext update]. GLFW does not call it either, it uses NSOpenGLView, which handles geometry/display updates automatically.

theagentd · May 10, 2015, 8:51pm

Alright, thanks for the info.

Morgan_Allen · May 10, 2015, 8:55pm

@Spasi, KaiHH- I fear the tech-level here is a little above my head, but I appreciate the explanations.

@theagentd- I suspect that’s what’s happening as well. (I would have thought that even suboptimal use of the GPU wouldn’t make a huge difference to my sim-logic, since I’m not calling glFinish() anywhere and it should be working away asynchronously.)

I might try integrating those GPU-queries anyway: my simulation doesn’t happen on a different thread, but it does happen later in the render cycle from the graphics calls- the two shouldn’t be waiting on eachother much on a call-to-call basis.

I did some profiling earlier on TerrainSet.refreshAllMeshes(), and AFAICT that’s all concentrated in initial map setup- the meshes stay almost entirely stable afterward.

What I did notice was that, every few seconds, one particular creature/person would take a crazy long time (like 400 ms) to update within a given frame. Thing is, if I cut out the section of code responsible, the delay would jump somewhere else- to a particular type of building, or the base(colony)-update call- a bit like whack-a-mole. (I was wondering if these might be garbage collection, since I’m not using Pools anywhere, but I’ve never seen GC cause stuttering like this before.)

EDIT: Oh, I also asked my tester-buddy to run the game on a large map on his home machine. He didn’t see any slowdown, but that was on a 3.9 GHz PC with an AMD R7 260X. Much more brute force. Might ask him to profile it later.

KaiHH · May 10, 2015, 9:17pm

You also might want to try the gDEBugger GL.
It is available for Windows and Linux (sadly not OS X).
This tools intercepts OpenGL calls with its own shim driver DLL/.so file and analyzes everything about your app with various perf counters.
You must give that tool the native executable that it should invoke, which would be java.exe, together with the JVM arguments to startup your application (likely -jar if your app is nicely assembled into a single runnable jar file and -Djava.library.path for LWJGL’s native files).
I just tried it and it works nicely.

Morgan_Allen · May 16, 2015, 11:16am

I appreciate the tip, but might not get the chance to delve into this until next week. Scheduling. Bah.