Proper GPU profiling with LWJGL

theagentd · May 11, 2014, 12:46am

Profiling is important when optimizing games. When using OpenGL through LWJGL, it’s important to know that the CPU and GPU work asynchronously on very different tasks, so depending on which one of them that is your bottleneck you may get very confusing results. You will not get correct values if you simply measure the time it takes to run an OpenGL command since the command is just added to the OpenGL queue until the queue is full, which is very fast. If the GPU is unable to keep up with the commands submitted, it’ll eventually block the application until the GPU has caught up and there’s room in the queue. That means that OpenGL may block on random OpenGL commands or even on Display.update(), which means that you could get a huge CPU time spike for something as simple as glBindTexture() or a similar command.

The solution is to go directly to the source and ask the GPU how long time stuff took to execute for the GPU. This is tricky to do without affecting performance negatively. One idea is to call an OpenGL command, then use glFinish() to force the CPU to block while the GPU executes all queued up commands, but this causes both a CPU stall and then a GPU stall since the command queue will be completely empty once glFinish() returns leading to massively reduced performance. A better solution is to use OpenGL asynchronous queries to query the time at which the GPU reaches certain points in the command queue. We then delay reading back the result until the GPU has finished the frame we’re profiling.

When profiling on the CPU we can do the following:


long startTime = System.nanoTime();

//run code to profile

long timeTaken = System.nanoTime() - startTime; //Time taken in nano seconds

We can do the same thing with GPU queries, but the asynchronous nature of queries complicates it all a bit:



int startQuery = glGenQueries();
glQueryCounter(startQuery , GL_TIMESTAMP);

//Run OpenGL commands to profile

int endQuery = glGenQueries();
glQueryCounter(endQuery, GL_TIMESTAMP);

//Make sure enough time passes so the above OpenGL commands have been processed by the GPU

if(glGetQueryObjectui(endQuery, GL_QUERY_RESULT_AVAILABLE) == GL_TRUE){ //The result is available
    
    long startTime = glGetQueryObjectui64(startQuery, GL_QUERY_RESULT);
    long endTime = glGetQueryObjectui64(endQuery, GL_QUERY_RESULT);
    long timeTaken = endTime - startTime;
}

I have developed two small static classes that can take care of GPU profiling. The use of these two classes is as follows (extra brackets added for clarity):


//No special initialization required

public void render(){
    
    GPUProfiler.startFrame();
    
    GPUProfiler.start("Render game");
    {
        GPUProfiler.start("Render tiles");
        renderTiles();
        GPUProfiler.end();

        GPUProfiler.start("Render objects");
        renderObjects();
        GPUProfiler.end();
    }
    GPUProfiler.end();

    GPUProfiler.start("Render UI");
    renderUI();
    GPUProfiler.end();

    GPUProfiler.endFrame();

    /*
    The GPUProfiler keeps track of previous frames until 
    they've been rendered by the GPU and their results are
    ready to be read back. The following line dumps the 
    timings of all completed frames to System.out.
    */

    GPUTaskProfile tp;
    while((tp = GPUProfiler.getFrameResults()) != null){
        
        tp.dump(); //Dumps the frame to System.out.
        //or use the functions of GPUTaskProfile to extract information about the frame:
        //getName(), getStartTime(), getEndTime(), getTimeTaken(), getChildren()
        //Then you can draw the result as fancy graphs or something.

        GPUProfiler.recycle(tp); //Recycles GPUTaskProfile instances and their OpenGL query objects.
    }
}

Here’s some example output of dump() when using my GPU profiler with the engine I’m writing. Note that the CPU time taken to execute all my OpenGL functions is around 1.5ms, with Display.update() blocking for around 25 ms after each frame making it impossible to tell what’s actually keeping the GPU so busy.

Frame 510 : 29.112ms
WSWGraphics.render() : 29.105ms
Render scene : 5.953ms
Clear G-buffer : 0.061ms
Render tiles : 2.81ms
Render models : 1.623ms
Render ocean : 0.07ms
Render decals : 0.048ms
Render glowing models : 0.936ms
Linearize depth : 0.085ms
Scene lighting : 17.37ms
Unshadowed point lights : 0.0ms
Unshadowed cone lights : 0.0ms
Unshadowed directional lights : 0.0ms
Shadowed point lights : 17.345ms
Shadow map pass 0 : 2.038ms
Shadow map initializing : 0.044ms
Shadow map rendering : 0.88ms
Shadow map prefiltering : 0.0ms
Light rendering : 1.11ms
Shadow map pass 1 : 5.408ms
Shadow map initializing : 0.037ms
Shadow map rendering : 4.701ms
Shadow map prefiltering : 0.0ms
Light rendering : 0.666ms
Shadow map pass 2 : 1.834ms
Shadow map initializing : 0.036ms
Shadow map rendering : 1.152ms
Shadow map prefiltering : 0.0ms
Light rendering : 0.641ms
Shadow map pass 3 : 1.303ms
Shadow map initializing : 0.034ms
Shadow map rendering : 1.053ms
Shadow map prefiltering : 0.0ms
Light rendering : 0.213ms
Shadow map pass 4 : 4.298ms
Shadow map initializing : 0.036ms
Shadow map rendering : 3.143ms
Shadow map prefiltering : 0.0ms
Light rendering : 1.116ms
Shadow map pass 5 : 2.458ms
Shadow map initializing : 0.034ms
Shadow map rendering : 1.757ms
Shadow map prefiltering : 0.0ms
Light rendering : 0.662ms
Shadowed cone lights : 0.0ms
Shadowed directional lights : 0.0ms
Render transparency : 0.442ms
Render effects : 0.267ms
StreakRenderer : 0.011ms
ParticleRenderer : 0.011ms
ShockwaveRenderer : 0.0ms
Resolve : 0.173ms
Render distortions : 0.219ms
Render effects : 0.012ms
Blur : 0.206ms
Postprocessing : 2.988ms
MergeProcessor : 0.438ms
MotionBlurProcessor : 0.333ms
BloomRenderer : 0.84ms
ToneMapper : 0.416ms
FXAAProcessor : 0.52ms
TemporalSSAAProcessor : 0.436ms
Screen copy : 0.141ms
Render UI : 0.005ms
Render UI elements : 0.003ms
Distort UI : 0.0ms

Seems like I have too many shadow mapped lights. =P

Here are the two classes:
GPUProfiler: http://www.java-gaming.org/?action=pastebin&id=930
GPUTaskProfile: http://www.java-gaming.org/?action=pastebin&id=931

ra4king · June 28, 2014, 9:56pm

I’d like to note that this feature is for OpenGL 3.3+ only.

Very invaluable feature though, thank you for bringing it to my attention and thank you for your utility classes!

EDIT: You used your engine.etc.pool.Pool class for this, I’m assuming this is a normal object pool with a set number of initial instances?
Secondly, what’s the difference between java.util.ArrayList and your FastArrayList?

EDIT2: More researched has led me to the GL_TIME_ELAPSED query. Why not use that instead of getting the time before and after?

theagentd · June 29, 2014, 1:09am

Shiiit, I had no idea I had those dependencies.

FastArrayList is functionally identical to ArrayList, but has a few performance pros. The most important one is that clear() does not null out each element (it works similarly to ByteBuffer.clear() in that it just sets the size to 0). It’s definitely not required in any way, and it should be possible to simply change all FastArrayLists to normal ArrayLists as the method names are identical.

Pool and Poolable simply provide a normal object pool to avoid generating too much garbage, and should be easy to implement. Again, not necessary; it’s just a (possibly premature) optimization.

GL_TIME_ELAPSED has a big problem. You cannot run nested GL_TIME_ELAPSED queries.


glBeginQuery(GL_TIME_ELAPSED, query);

...

    glBeginQuery(GL_TIME_ELAPSED, nestedQuery); //ILLEGAL!
    ...
    glEndQuery(GL_TIME_ELAPSED);

...

glEndQuery(GL_TIME_ELAPSED);

Hence, emulating GL_TIME_ELAPSED with GL_TIMESTAMP is much more flexible.

PandaMoniumHUN · June 29, 2014, 9:23am

This is awesome, thank you very much for your work theagentd!
One question though, what about the license? Can I integrate these into my engine if I give credit?

theagentd · June 29, 2014, 1:12pm

It’s here to be used, knock yourself out. It’s not much, but if you feel like giving me credit then I’m not gonna say no to ending up on a credits screen. =P

BurntPizza · June 29, 2014, 1:26pm

Maybe the Don’t Ask Me About It License?

ra4king · June 29, 2014, 8:06pm

I really like the WTFPL

roland · November 9, 2014, 9:16pm

Sorry for the late reply. This is a great tool, thanks theagentd!
Is there any way to modify this for a graphics card that only supports OpenGL 2.1?

basil · November 9, 2014, 10:26pm

nice! i too did not see this post.

[icode]glGetQueryObjecti()[/icode] is available since opengl 1.5 but will return integer-values while the new tokens [icode]GL_TIME_ELAPSED[/icode] and [icode]GL_TIMESTAMP[/icode] are ment to return 64bit values. they work only with [icode]glGetQueryObjecti64()[/icode] which is gl33+, sorry.

btw, @agentd : did you consider double/tripple-buffering ? i figured, query-objects can stall when used too early. could be a nice addition to your tool.

it’s also nice to still track the cpu-times together with the timer-query, which gives us a better feel for cpu/gpu work balance.

here’s a question : did you get the timer-query to work with multithreading ? i get all kind of funky results, even more when using doublebuffering. :persecutioncomplex:

theagentd · November 10, 2014, 12:22am

Not really. You could check for support for the ARB_timer_query extension specifically, which is supported by Nvidia’s OGL 2.1 GPUs, but not AMD’s or shiver Intel’s older GPUs. It’s worth noting that according to the Steam Hardware Survey, the number of people having GPUs that don’t support OGL3 is 2.22%, and a large chunk of those either don’t have powerful enough hardware to run your game and/or have Nvidia’s OGL 2.1 GPUs.

The profiler already avoids stalls completely by checking if the result is already available, so it never has to freeze the CPU thread. It buffers query objects for as long as it needs before reading them back, which is generally 1 or 2 frames.

It makes no sense to combine this with a CPU time profiler, as that information is 1) useless if we’re GPU limited (random OGL commands will stall giving inaccurate results) and 2) we’re usually not interested in profiling both CPU and GPU performance at the same granularity. It makes no sense to do GPU profiling of your CPU-side game logic. I do have a separate CPU profiler which does the same but for CPU time, minus the delayed readback, and my threading library has per-thread profiling of tasks.

I have no idea why this is relevant. You most likely only make your OpenGL calls from a single thread, so why worry about multithreading? You insert these time stamp queries into the command queue of OpenGL so that you can get at what time THE GPU reaches that part of the command queue, so even if you make your queries from multiple OpenGL contexts/threads, you’d still be inserting all these queries into the same command queue. I don’t get what your problem is. Of course, there’s no guarantee of the ordering of OpenGL calls made from different threads unless you force it using glFinish() or so.

basil · November 10, 2014, 12:48am

i’m not talking about sending queries form different threads to a shared que, that’s pretty much pointless and asking for trouble. a shared gl-context has it’s own command que. same point profiling that as any other.

ofc, profiling the cpu at the same time is only interesting when you do cpu heavy stuff side by side, or chase performance hogs.

theagentd · November 10, 2014, 12:51am

You can have thread-local profiler instances in that case.