Proper GPU profiling with LWJGL

Profiling is important when optimizing games. When using OpenGL through LWJGL, it’s important to know that the CPU and GPU work asynchronously on very different tasks, so depending on which one of them that is your bottleneck you may get very confusing results. You will not get correct values if you simply measure the time it takes to run an OpenGL command since the command is just added to the OpenGL queue until the queue is full, which is very fast. If the GPU is unable to keep up with the commands submitted, it’ll eventually block the application until the GPU has caught up and there’s room in the queue. That means that OpenGL may block on random OpenGL commands or even on Display.update(), which means that you could get a huge CPU time spike for something as simple as glBindTexture() or a similar command.

The solution is to go directly to the source and ask the GPU how long time stuff took to execute for the GPU. This is tricky to do without affecting performance negatively. One idea is to call an OpenGL command, then use glFinish() to force the CPU to block while the GPU executes all queued up commands, but this causes both a CPU stall and then a GPU stall since the command queue will be completely empty once glFinish() returns leading to massively reduced performance. A better solution is to use OpenGL asynchronous queries to query the time at which the GPU reaches certain points in the command queue. We then delay reading back the result until the GPU has finished the frame we’re profiling.

When profiling on the CPU we can do the following:


long startTime = System.nanoTime();

//run code to profile

long timeTaken = System.nanoTime() - startTime; //Time taken in nano seconds

We can do the same thing with GPU queries, but the asynchronous nature of queries complicates it all a bit:



int startQuery = glGenQueries();
glQueryCounter(startQuery , GL_TIMESTAMP);

//Run OpenGL commands to profile

int endQuery = glGenQueries();
glQueryCounter(endQuery, GL_TIMESTAMP);

//Make sure enough time passes so the above OpenGL commands have been processed by the GPU

if(glGetQueryObjectui(endQuery, GL_QUERY_RESULT_AVAILABLE) == GL_TRUE){ //The result is available
    
    long startTime = glGetQueryObjectui64(startQuery, GL_QUERY_RESULT);
    long endTime = glGetQueryObjectui64(endQuery, GL_QUERY_RESULT);
    long timeTaken = endTime - startTime;
}


I have developed two small static classes that can take care of GPU profiling. The use of these two classes is as follows (extra brackets added for clarity):


//No special initialization required

public void render(){
    
    GPUProfiler.startFrame();
    
    GPUProfiler.start("Render game");
    {
        GPUProfiler.start("Render tiles");
        renderTiles();
        GPUProfiler.end();

        GPUProfiler.start("Render objects");
        renderObjects();
        GPUProfiler.end();
    }
    GPUProfiler.end();

    GPUProfiler.start("Render UI");
    renderUI();
    GPUProfiler.end();

    GPUProfiler.endFrame();

    /*
    The GPUProfiler keeps track of previous frames until 
    they've been rendered by the GPU and their results are
    ready to be read back. The following line dumps the 
    timings of all completed frames to System.out.
    */

    GPUTaskProfile tp;
    while((tp = GPUProfiler.getFrameResults()) != null){
        
        tp.dump(); //Dumps the frame to System.out.
        //or use the functions of GPUTaskProfile to extract information about the frame:
        //getName(), getStartTime(), getEndTime(), getTimeTaken(), getChildren()
        //Then you can draw the result as fancy graphs or something.

        GPUProfiler.recycle(tp); //Recycles GPUTaskProfile instances and their OpenGL query objects.
    }
}

Here’s some example output of dump() when using my GPU profiler with the engine I’m writing. Note that the CPU time taken to execute all my OpenGL functions is around 1.5ms, with Display.update() blocking for around 25 ms after each frame making it impossible to tell what’s actually keeping the GPU so busy.

Seems like I have too many shadow mapped lights. =P

Here are the two classes:
GPUProfiler: http://www.java-gaming.org/?action=pastebin&id=930
GPUTaskProfile: http://www.java-gaming.org/?action=pastebin&id=931

I’d like to note that this feature is for OpenGL 3.3+ only.

Very invaluable feature though, thank you for bringing it to my attention and thank you for your utility classes!

EDIT: You used your engine.etc.pool.Pool class for this, I’m assuming this is a normal object pool with a set number of initial instances?
Secondly, what’s the difference between java.util.ArrayList and your FastArrayList?

EDIT2: More researched has led me to the GL_TIME_ELAPSED query. Why not use that instead of getting the time before and after?

Shiiit, I had no idea I had those dependencies.

FastArrayList is functionally identical to ArrayList, but has a few performance pros. The most important one is that clear() does not null out each element (it works similarly to ByteBuffer.clear() in that it just sets the size to 0). It’s definitely not required in any way, and it should be possible to simply change all FastArrayLists to normal ArrayLists as the method names are identical.

Pool and Poolable simply provide a normal object pool to avoid generating too much garbage, and should be easy to implement. Again, not necessary; it’s just a (possibly premature) optimization.

GL_TIME_ELAPSED has a big problem. You cannot run nested GL_TIME_ELAPSED queries.


glBeginQuery(GL_TIME_ELAPSED, query);

...

    glBeginQuery(GL_TIME_ELAPSED, nestedQuery); //ILLEGAL!
    ...
    glEndQuery(GL_TIME_ELAPSED);

...

glEndQuery(GL_TIME_ELAPSED);

Hence, emulating GL_TIME_ELAPSED with GL_TIMESTAMP is much more flexible.

This is awesome, thank you very much for your work theagentd! :wink:
One question though, what about the license? Can I integrate these into my engine if I give credit?

It’s here to be used, knock yourself out. It’s not much, but if you feel like giving me credit then I’m not gonna say no to ending up on a credits screen. =P

Maybe the Don’t Ask Me About It License? :stuck_out_tongue:

I really like the WTFPL :slight_smile:

Sorry for the late reply. This is a great tool, thanks theagentd!
Is there any way to modify this for a graphics card that only supports OpenGL 2.1?

nice! i too did not see this post. :slight_smile:

[icode]glGetQueryObjecti()[/icode] is available since opengl 1.5 but will return integer-values while the new tokens [icode]GL_TIME_ELAPSED[/icode] and [icode]GL_TIMESTAMP[/icode] are ment to return 64bit values. they work only with [icode]glGetQueryObjecti64()[/icode] which is gl33+, sorry.

btw, @agentd : did you consider double/tripple-buffering ? i figured, query-objects can stall when used too early. could be a nice addition to your tool.

it’s also nice to still track the cpu-times together with the timer-query, which gives us a better feel for cpu/gpu work balance.

here’s a question : did you get the timer-query to work with multithreading ? i get all kind of funky results, even more when using doublebuffering. :persecutioncomplex:

Not really. You could check for support for the ARB_timer_query extension specifically, which is supported by Nvidia’s OGL 2.1 GPUs, but not AMD’s or shiver Intel’s older GPUs. It’s worth noting that according to the Steam Hardware Survey, the number of people having GPUs that don’t support OGL3 is 2.22%, and a large chunk of those either don’t have powerful enough hardware to run your game and/or have Nvidia’s OGL 2.1 GPUs.

The profiler already avoids stalls completely by checking if the result is already available, so it never has to freeze the CPU thread. It buffers query objects for as long as it needs before reading them back, which is generally 1 or 2 frames.

It makes no sense to combine this with a CPU time profiler, as that information is 1) useless if we’re GPU limited (random OGL commands will stall giving inaccurate results) and 2) we’re usually not interested in profiling both CPU and GPU performance at the same granularity. It makes no sense to do GPU profiling of your CPU-side game logic. I do have a separate CPU profiler which does the same but for CPU time, minus the delayed readback, and my threading library has per-thread profiling of tasks.

I have no idea why this is relevant. You most likely only make your OpenGL calls from a single thread, so why worry about multithreading? You insert these time stamp queries into the command queue of OpenGL so that you can get at what time THE GPU reaches that part of the command queue, so even if you make your queries from multiple OpenGL contexts/threads, you’d still be inserting all these queries into the same command queue. I don’t get what your problem is. Of course, there’s no guarantee of the ordering of OpenGL calls made from different threads unless you force it using glFinish() or so.

i’m not talking about sending queries form different threads to a shared que, that’s pretty much pointless and asking for trouble. a shared gl-context has it’s own command que. same point profiling that as any other.

ofc, profiling the cpu at the same time is only interesting when you do cpu heavy stuff side by side, or chase performance hogs.

You can have thread-local profiler instances in that case.