how can i tell if my JOGL code is being hardware accelerated?

theagentd · May 27, 2012, 2:57am

ags1:

My card is very old (and is a cut-down laptop version too), and my code is very naive, but i do not get that performance. I am also deliberately adding features to make the card work harder - verteces are not reused, everything is translucent and so on. I converted my code from VAs to VBOs tonight and compared the results between the two methods.

7500 74 75
60000 35 36
202500 23 22
480000 16 17
937500 13 13
1620000 10 11
3840000 7 8
7500000 5 5
12960000 3 x

The first column is the number of verteces, the second is the VA rendering and the third column is the VBO rendering. The VBO performance might be slightly faster but it doesn’t feel like a my programming efforts have been rewarded. My theory is that the processing on the card is so slow the benefit of putting the data on the card is masked by the vast processing time.

But if you get 60FPS on a GT460, 5 FPS on a Mobility 3450 actually sounds quite reasonable.

It’s a GTX 460M, not a GTX 460. It’s a laptop, the M should stand for mobile or something. For reference, a brand new GTX 680 has around 7 times the raw processing power of my card. Your comparison still kind of holds, since your card is so old.

Mobility Radeon 3450: 40 GFLOPS, 6.4GB/sec bandwidth
GTX 460M: 518.4 GFLOPS, 60GB/sec bandwidth.

Gee. I wonder why my card is faster. Go figure. xD

Riven · May 27, 2012, 2:59am

fillrate!

theagentd · May 27, 2012, 3:01am

Last time I checked the development of monitors we hadn’t invented holographic 3D monitors yet so a “70x70x70 unit cube” says nothing about how many pixels they actually cover on the screen after transformation, but yeah, it might be a bottleneck in this case (though the raw performance numbers disagree).

Riven · May 27, 2012, 3:24am

You (condescendingly…) prove my point. I never said it was a decisive factor, just that it was a factor. Comparing vertex throughput is meaningless if you don’t share information about fragment throughput.

theagentd · May 27, 2012, 3:38am

Ah, sorry… I always manage to sound a lot more mean than I intend to… ._.

ra4king · May 27, 2012, 4:17am

Welcome to the Internet, where emotion doesn’t exist… :emo:

ags1 · May 27, 2012, 1:26pm

I think the fillrate is an issue, at least I intend it to be! I get dramatic differences in the frame rate by altering the density of the cube (lower density means the cube is bigger so more elements are offscreen and not rendered). It suggests that the test is not increasing cubically in its demand but linearly as I am just seeing a small section of the cube that gets deeper with each run, not wider.

My objective is to create a scene generator that scales itself until it finds the the maximum performance of a given card. It looks like throwing polygons at the card is not enough - a modern card will run out of memory before it runs out of processing capacity. Possibly making my cube superdense might solve the issue (so I get cubic not linear scaling), but maybe I need to add more effects- like specular lighting (that sounds expensive…?).

ags1 · May 27, 2012, 8:06pm

I corrected my test to increase the density of the cube (so the visible verteces go up as the cube of the scale), and get more sensible results (old constant density compared to constant volume):

7500 74 69
60000 35 13
202500 23 5
480000 16 2

theagentd · May 28, 2012, 12:04am

If you’re measuring fill-rate, the number of vertices has no meaning whatsoever. I can fill my screen with 20 fullscreen sized quads = 80 vertices and get 60 FPS. You should be measuring overdraw instead, which is basically how many times a single pixel is drawn to. It’s difficult to get a number on it, but you can visualize it by simply enabling blending and additively blend in a single color, for example a weak red so pixels with lots of overdraw will have a stronger red color.

Another tip is back-face culling if you haven’t already enabled it.

ags1 · May 28, 2012, 5:20pm

Yes I am trying to maximize the number of floating point operations done per pixel. So I have color and alpha gradients on each triangle (I’ve stopped using quads)and I draw transparent objects from back to front to force overdraw each and every time. On level 2 of my test, each pixel should be redrawn about 30 times. On level three this is nearer 50 redraws, and so on. The redraws go up linearly, but drop in performance is closer to cubic now, so the number of vertexes seems to be relevant.

theagentd · May 29, 2012, 12:50pm

Blending is handled by the ROPs, so blending shouldn’t have a very high per-pixel cost in practice. The ROPs are bandwidth limited, and they are made to handle deferred shading and HDR rendering which uses a lot more bandwidth, so it’s almost free but you might want to test that for your specific card. Color gradients should be free too since the per pixel interpolation is hardware accelerated and also most likely done per pixel regardless of if the colors are constant or varying per vertex. The most expensive operation you can add without shaders is texturing, which introduces some bandwidth and math costs for sampling and multiplying in the texture color.

princec · May 29, 2012, 1:04pm

What’s a ROP?

Cas

theagentd · May 29, 2012, 1:40pm

Raster OutPut unit. The hardware units that handle combining the pixel shader output with the frame buffer. This is why we don’t have blending shaders and still have the fixed functionality blending with glBlendFunc() and glBlendEquation().

matheus23 · May 29, 2012, 1:44pm

I’m glad, we don’t have any shaders for that part of the GPU. Hardware is much faster than Shaders, and at this point, I can’t think of any other funcionality needed, exept for the already existing blend functions, and abilities in the fragment shader.

ags1 · May 29, 2012, 10:45pm

Well I must say I am appalled. My new GT430 does not perform significantly better than the 3450 - 10% better at most. It may be that my test is ROP-intensive and the GT430 does not have a lot of ROPs.

The GLProfile tells me the code is being hardware accelerated. I will try download a new driver tomorrow and see if that fixes the issue.

I had a suspicion that my newbie GL code might be preventing hardware acceleration from actually happening, but my daughter’s PC (a gaming PC from three years ago, she has no idea what GPU she has, she only knows it plays Warcraft…) gets more believable results:

Test level GT430 Gaming PC
1³ 80 59
2³ 13 59
3³ x 25
4³ x 15

I presume she has some setting in the driver that is stopping her card from rendering faster than the screen refresh rate.

theagentd · May 30, 2012, 6:19am

ags1:

Well I must say I am appalled. My new GT430 does not perform significantly better than the 3450 - 10% better at most. It may be that my test is ROP-intensive and the GT430 does not have a lot of ROPs.

The GLProfile tells me the code is being hardware accelerated. I will try download a new driver tomorrow and see if that fixes the issue.

I had a suspicion that my newbie GL code might be preventing hardware acceleration from actually happening, but my daughter’s PC (a gaming PC from three years ago, she has no idea what GPU she has, she only knows it plays Warcraft…) gets more believable results:

Test level GT430 Gaming PC
1³ 80 59
2³ 13 59
3³ x 25
4³ x 15

I presume she has some setting in the driver that is stopping her card from rendering faster than the screen refresh rate.

That’s V-sync limiting the FPS to 60. I could try it on my laptop for you. If you’re not getting good scaling with a better graphics card you might be CPU limited. A GT430 (assuming the desktop version) can do 268.8 GFLOPs compared to your old laptop’s card’s 40, so it should be a few times faster at least. Not necessarily linearly though, GFLOPS is a very inaccurate way of measuring actual game performance.

ags1 · May 30, 2012, 9:55am

UPDATE: disabling blending of transparent layers gives a totally different picture, with an order of magnitude difference between the performance of the GT430 and 3450. So if I cut out the fifty-plus alpha calculations per pixel the GT430 gives the performance differential that I expected over the 3450, exceeding my expectations in fact.

I knew the GT430 only has 4 ROPs so the result makes a lot of sense (the 3450 also has four ROPs, seemingly performing only a bit - 33% - slower). But when the test is to draw a high volume of polygons with no blending, the 430 pulls far ahead of the 3450.

theagentd · May 30, 2012, 12:47pm

TL;DR: This is a ramble about how a GPU works and why GPU performance is so hard to predict for different GPUs and OpenGL settings. It’s not meant to be read by people who aren’t interested in 3D or how GPUs work on a lower level than OpenGL.

It’s worth noting that vertex performance does not scale linearly with GPU power. A better GPU is of course faster, but not linearly. Pixel filling performance scales a bit better, but what scales the best is probably shader complexity. (Note: That came from my own personal experience, so it might not be accurate across the board.) Most games don’t want to be able to draw 10 million triangles per frame or have 50x overdraw over the whole screen. They want expensive render targets (usually 3 or 4 x 16-bit float RGBA), complex lighting calculations, more texture bandwidth and the likes, while MINIMIZING overdraw. BF3 even uses computing to do the lighting, which has proved to be a lot more effective than rendering lighting geometry with OpenGL for deferred shading. GPUs are obviously made for the games that use them, so a pathological case with lots of blending, lots of vertices or lots of cheap pixels, etc is not going to perform as well as a more realistic case for the GPU. We pretty much encounter a somewhat similar problem to microbenchmarking but for GPUs. We might also bottleneck one part which can leave other operations free since the hardware is a lot more specialized.

The GPU also does lots of optimizations based on what you enable. Enabling the depth test actually increases performance for 3D games (may be as high as 2-3x) depending on how many pixels that get rejected by the depth test since it can do the depth test before the color of the pixel is calculated. Face culling can also improve performance a lot. However, some of these optimizations might be unusable with some combinations of settings. For example, for blending to be accurate the polygons need to be processed in the order they are submitted, while without blending we’re only interested in the closest pixel, so the GPU can in theory process them in any order it want to (this is a speculation). Other things like enabling alpha testing or modifying the depth of a pixel with a shader also forces the the shader/fixed functionality to be run before the depth/alpha test. It’s easy to accidentally produce a case where the GPU cannot use such optimization. Even worse, the flexibility of the GPU varies between GPU generations and even more between vendors, so what works for you might crawl on another GPU, or vice versa.

I don’t even know a fraction of what my GPU does, but at least I know that I don’t know much about it and I take that into account. Ensuring that a you make it as easy as possible for the GPU to use its optimizations is important for real-world performance. Ever heard of a z pre-pass? It’s when you draw everything in the game twice to increase performance. Makes sense, doesn’t it? By first drawing only the depth of the scene to the depth buffer, we can then enable the depth test to only run the shader on the pixels that are actually visible. We might double the vertex cost of the game to reduce the amount of SHADED overdraw to 0, which might be a perfectly valid tradeoff if your pixel shaders are expensive enough.

GPUs are massive parallel processors. A GT430 “only” has:

96 unified shader processors. They can switch between processing vertices and pixels to adapt to some extent for an ineven workload. This was also introduced when deferred shading became big, when doing deferred shading you first have a very vertex-heavy workload, but then switch to lighting processing which is 100% pixel limited instead. Ancient cards with separate vertex shaders and pixel shaders would have half their shaders stalled when doing deferred shading.
16 texture mapping units. Not sure about all the things that they do, but they do handle bilinear filtering and spatial caching of texture samples. Bilinear filtering is free for 8-bit RGBA textures on today’s GPUs thanks to these. The texture cache also helps hugely when we are simply sampling a small local part of a texture. In this thread I made a program that benefited a lot from this cache. By zooming out too much, I could see what happens when we start to pretty much randomly sample the tile map, and FPS dropped from 1350 to 450 FPS. The shader workload remains identical, but we get a texture bottleneck! Wooh! That means we could do math in the shader for free as long as it isn’t dependent on the texture samples! Confused yet?
4 ROPs. You made me a bit curious, and it seems like “The ROPs perform the transactions between the relevant buffers in the local memory - this includes writing or reading values, as well as blending them together.” (Wikipedia). That would mean that the ROPs also handle the depth test and stencil test too in addition to blending. You learn something new everyday!

That’s usually written as 96:16:4. On the other hand, a Radeon HD 7970 has a core configuration of 2048:128:32. We have 21,3x the number of shaders, 8 times the number of texture mapping units and 8 times the number of ROPs. What’s up with the number of shaders?! Well, Radeon cards have traditionally had a higher number of shaders. NVidia countered this by having a separate shader clock which ran at double the clock of everything else on the card, meaning that the HD 7970 “only” has 10.6x the number of shaders in practice. (NVidia have with the recently released 700 series ditched the separate shader clock and tripled the number of shaders to match AMDs setup which is more power efficient.) On top of that, a ROP on one card may not equal a ROP on another card. For example, Nvidia’s ROPs are famous for being able to pump out more pixels per clock than AMD’s. Memory bandwidth also affects texture performance, ROP performance, how well the game scales with higher resolutions, etc. On top of THAT, we also have a GPU clock and a memory clock, which affect performance the same way they do for CPUs. Yes, it’s common to overclock your GPU if you have the cooling for it. NVidia’s newest cards released months ago even have built in overclocking when the card isn’t using as much power as it similar to how a quad-core CPU increases their clock-rate when not all cores are used. This was a great idea, since GPUs have so many hardware features that may not be used to their potential in a certain game. Here’s a diagram over a GTX 680, NVidias most powerful GPU at the moment:

http://www1.pcmag.com/media/images/285620-nvidia-geforce-gtx-680-block-diagram.jpg

Only the small green boxes are unified shader units. What’re all those other things?! Here’s a zoomed in picture of one of an SMX:

Polymorph Engines (yellow): DX11 generation cards (your GT430 included) has hardware tesselators which can effectively cut up a triangle to generate smaller triangles, which can be displaced to create real uneven surfaces from a single triangle. NVidia does this with their Polymorph Engines, among other things as you can see (The 400 and 500 series use the first generation, the GTX 600 series the second, therefore 2.0). Does your game use tessellation? If no, we have idle hardware on your card. Doing too heavy tessellation of triangles can easily turn into a bottleneck too.

SFUs (dark green): These handle special floating point math (Special Floating point Unit), presumably things like trigonometric functions and maybe even square roots. They’re shared by a few shader units each, so they could be a bottleneck too.

Raster Engines (yellow): I believe these take in vertices and optionally indices, put together triangles and outputs which pixels are covered by the triangle (= rasterizing). Could these be the bottleneck when we have to much overdraw of cheap pixels?

Conclusion: I could go on and on, but I think you get it by now or rather, you get that you don’t get it. xD What’s bottlenecking your program? I don’t know. =D I wasn’t saying that the ROPs were bottlenecking your programming, I meant that it could be anything, and that you’re not utilizing your GPU in the way the makers of it expected.

princec · May 30, 2012, 1:03pm

Nicely informative post about how it all works.

Cas

ags1 · May 30, 2012, 4:41pm

Great information there. I agree the GPU architecture is too complex to make simplistic comparisons. And that’s leaving aside the drivers which introduce another level of uncertainty. The only way would be sure how the app performs relative to architecture would be to run it on lots of different cards.