PBO performance problems

Hi,

I want to improve the texture update speed of my GL renderer by using PBOs. Right now, with standard glTexSubImage2D() updates on a 1024x1024 BGRA texture, I’m getting around 125fps which isn’t too bad. However for some reason using a PBO as a pixel unpack buffer, mapping that to a ByteBuffer, and copying the contents into the mapped buffer, I get only 20fps??? The bizarre thing is, the same PBO path coded in C++ performs extremely well (up to 250fps), so what have I done wrong in the java/jogl version…

Please if someone could help that would be fab. I have attached the GLEventListener implementation to this post.

PS: I’m using JSR-231 beta 01 - October 27…

You shouldn’t create a new DebugGL object each time; make one in your init() method and call drawable.setGL(). That will ensure it is used in your reshape, display, etc. methods.

I don’t think that should be the cause of the slowdown though. From your example you’re using static texture data, so there shouldn’t be any issue with the setup of the data taking longer in Java than in C++. Are you 100% sure you’ve set up all of the pixel unpack modes, etc. which might be needed to ensure your texture data isn’t undergoing any conversion after you send it down each time? I’m pretty ignorant of PBOs so I apologize if this suggestion is meaningless.

If you have side-by-side C++ and Java code with a significant performance difference, could you please zip it up, file a bug with the JOGL Issue Tracker (you’ll probably need to be an Observer of the project to do so), and attach your test cases?

Hi Ken,

First of all I’m absolutely amazed of your endeavour (and that of other forum members) in supporting us users!

OK so I’ve changed the DebugGL creation as per your advice but you were quite right this was not the cause of the performance problem. In terms of setting up unpack modes, I don’t think there’s any difference between PBOs and traditional glTexSubImage2D. The way the data is interpreted should be exactly the same… Also there’s my C++ test case which performs so well. I get the performance hit the moment the PBO is mapped, via glMapBuffer(). It almost feels like JOGL is taking a copy from VRAM (despite the GL_WRITE_ONLY)…

https://jogl.dev.java.net/issues/show_bug.cgi?id=188

Cheers,
Matt.

PS: I tried the latest nightly build, makes no difference.

Thanks for the report. I will try to look at it probably soon after the holidays, earlier if possible, but in the meantime I would strongly encourage you to look for any differences between your C++ and Java source. There is very little work being done by JOGL and unless something like vsync (see GL.setSwapInterval(0)) is getting in the way then there should be no performance difference between JOGL and C++. There is absolutely no weird behind-the-scenes memory management being done by JOGL.

I’ve got the swap interval set (see attachment to original post), and also in the NVidia driver property page. The C++ and Java do identical OpenGL commands, I really don’t see why the difference in speed. I’ve now tested this on two different PCs (one has a NVidia Geforce 4 Ti 4600 and the other has the 7800GT), both give the same indication. Would be very grateful if you could look into this.

This may be a stupid question but would there be any difference in the way the default OpenGL context is set up between JOGL and the C GLUT library?

Yes, the pixel format selection code in JOGL is completely different than that in GLUT; however in most if not all cases it should produce identical results. It delegates down to the platform’s pixel format selection routine like ChoosePixelFormat by default.

I don’t think I can reproduce the slowdown on my machine (Quadro FX Go700, 81.85 drivers, current JOGL from the CVS repository). Here are the results:

C++ version:
Average frame time: 16.7571ms
Average frame time: 16.7511ms
Average frame time: 16.757ms
Average frame time: 17.3289ms
Average frame time: 16.7543ms
Average frame time: 16.7545ms
Average frame time: 16.7536ms
Average frame time: 16.7542ms
Average frame time: 16.7563ms
Average frame time: 16.7517ms

Java version (1.5.0_06):
Average frame time: 14.457143ms
Average frame time: 14.098592ms
Average frame time: 14.521739ms
Average frame time: 14.940298ms
Average frame time: 14.521739ms

So on my machine it looks like the Java version is actually faster than the C++ version. What driver version, etc. are you running? Could you try the current JOGL nightly build?

I took my numbers using the latest NVidia drivers, 81.98 (81.95), but also with the 78.01.

Just to confirm, does your card support PBOs? I will try running with the latest nightly build in a moment.

Also, dunno how important this might be but I am running a dual display setup. One display is set to 1920x1200x32bpp, the other to 1600x1200x32bpp. Having said that, the “other” machine with the Geforce 4600 is with just one display (1600x1200x32).

Thanks,
Matt.

Guess what, on the dual display machine, turning off the second display (under Display Properties) helped, I now get timings similar between Java/JOGL and C++/GLUT, for PBO and traditional glTexSubImage… Unfortunately I will need to be able to support dual display configurations!

Ken, you mentioned a difference in the way that JOGL and GLUT choose their pixel formats. I’m not experienced in this, but maybe this is where the problem lies?

Note I’m using the JOGL nightly build that is currently available, dated 20th December.

Try printing out glGetString(GL_VENDOR), GL_VERSION, and GL_RENDERER for the two programs in dual-head mode. Is there any difference in e.g. the renderers? Do you have one or two cards in the dual-head machine? I suspect that there is more of a difference between how GLUT and the AWT sets up a window and how GLUT and JOGL choose pixel formats.

You can specify -Djogl.debug.WindowsGLDrawable to see which pixel format JOGL chooses; I don’t know how you could get the same information out of GLUT.

OK so here’s what I get with the Java side (see attached updated code):

dual-head (1920x1200;1600x1200):

AWT-EventQueue-0: Using ChoosePixelFormat because multisampling not requested
AWT-EventQueue-0: Chosen pixel format (6):
GLCapabilities [DoubleBuffered: true, Stereo: false, HardwareAccelerated: true, DepthBits: 24, StencilBits: 0, Red: 8, Green: 8, Blue: 8, Alpha: 0, Red Accum: 16, Green Accum: 16, Blue Accum: 16, Alpha Accum: 16 ]
GLEventHandler.init(): GL_VERSION = 2.0.1
GLEventHandler.init(): GL_VENDOR = NVIDIA Corporation
GLEventHandler.init(): GL_RENDERER = GeForce 7800 GT/PCI/SSE2
GLEventHandler.init(): streaming texture image using PBO
Average frame time: 49.095238ms
Average frame time: 49.857143ms
Average frame time: 50.75ms
Average frame time: 50.8ms
Average frame time: 51.55ms

single-head (1920x1200):

AWT-EventQueue-0: Using ChoosePixelFormat because multisampling not requested
AWT-EventQueue-0: Chosen pixel format (6):
GLCapabilities [DoubleBuffered: true, Stereo: false, HardwareAccelerated: true, DepthBits: 24, StencilBits: 0, Red: 8, Green: 8, Blue: 8, Alpha: 0, Red Accum: 16, Green Accum: 16, Blue Accum: 16, Alpha Accum: 16 ]
GLEventHandler.init(): GL_VERSION = 2.0.1
GLEventHandler.init(): GL_VENDOR = NVIDIA Corporation
GLEventHandler.init(): GL_RENDERER = GeForce 7800 GT/PCI/SSE2
GLEventHandler.init(): streaming texture image using PBO
Average frame time: 4.932039ms
Average frame time: 4.8846154ms
Average frame time: 4.7652583ms
Average frame time: 4.815166ms
Average frame time: 4.787736ms

Well as you can see there is no difference in the renderer, pixel format, etc. being chosen when your system is in single- vs. dual-head mode. This points to something lower-level going on, possibly in the AWT. I’m not really sure how to best diagnose this problem further. Ideally we would go into the GLUT sources, instrument them similarly to make sure the same pixel format is being chosen, and then check to see if there are significant differences in how the HWND is being set up. Do you have the time / interest in digging deeper into this? It may take some time but it would probably help others doing dual-head work.

Can you still reproduce the slowdown on the single-head machine? Could you print out the same output from that machine? It might be easier to figure out what’s going wrong there.

Hey,

I have now managed to hack GLUT to output its PIXELFORMATDESCRIPTOR, in a format similar to what JOGL outputs with ‘-Djogl.debug.WindowsGLDrawable’.

Chosen pixel format (7): DoubleBuffered: 1 Stereo: 0 HardwareAccelerated: 0 DepthBits: 24 StencilBits: 0 Red: 8 Green: 8 Blue: 8 Alpha: 0 Red Accum: 16 Green Accum: 16 Blue Accum: 16 Alpha Accum: 16
The only difference I can make out is HardwareAccelerated=0 for GLUT, =1 for JOGL.

That’s great. Have you checked both the JOGL and hacked GLUT code to see whether they are really using exactly the same PIXELFORMATDESCRIPTOR? JOGL reports it is using index 6 while GLUT reports 7, but I think JOGL is using 0-based indices while GLUT is using the Windows default 1-based indices.

My only suggestion here would be to write your own GLCapabilitiesChooser to force JOGL to use exactly the same index as GLUT and see if it changes the behavior. However I suspect that both are already really using the same pixel format.

I really don’t know what the issue could be. It might be related to properties in the window class set up by the AWT or may have something to do with how the AWT handles its multi-monitor support. If you already have a hacked GLUT then maybe the easiest thing to do would be to modify it to look more like the AWT’s window setup code and see if you can induce the slowdown in your modified version of GLUT. Hacking the JDK is less easy although in theory doable. You can find the current Mustang JDK sources at http://mustang.dev.java.net/ and the native code in question is in src/windows/native/sun/windows/ . If you get a chance to download and briefly look at these sources I can probably help you tweak the GLUT code if you need it.

I tried various PFDs in the hacked GLUT code, and all of them performed well (apart from occasional single-buffer screen flicker problems). I also added a custom CapsChooser to my GLCanvas ctor, returning various values, none of them improved performance in the slightest.

I was since wondering whether this could be a threading issue in the NVidia drivers. After all, I am actually having severe stability issues with this setup too, but that’s entirely OT. In case you’re interested, NVidia drivers don’t like dual-core CPUs at all (http://www.google.co.uk/search?q=forceware+dual+core+problem&start=0&ie=utf-8&oe=utf-8). Anyway I am not convinced this is causing the original problem, as instead of using an animator I tried ticking the scene from the event dispatch thread which makes it no faster.

I haven’t got the time just now to dig into the JDK sources. I never tried building it, for starters. Is that relatively easy? By the way, GLUT’s Win32 WNDCLASS setup is quite straightforward, I didn’t see anything out of the ordinary in there.

If I had a decent ATI gfx card I’d give it a try but the one I have (Radeon 7000) won’t do PBOs. Anyone out there willing to help?

I went to do some timings on “the other” machine with just one display attached to it, running off a Geforce 4 Ti 4600. On that setup JOGL and GLUT achieve comparable PBO timings of 10…12ms. This is in agreement with my findings on my primary machine when that is configured for single display mode (only that it’s much faster).

However, I couldn’t resist attaching a second display to that “other” box, and guess what the JOGL PBO timings go bad, same story as on my primary box. I have attached the output of running GLUT and JOGL proggies for this scenario, including the PFD attributes.

I have just downloaded Mustang build 65, will install that first and see if it makes any difference. If not then I’m going to try and build it from source, and then study the area in there that you pointed me at.

On the GeForce 4600 the chosen pixel formats are definitely different which you can see just from looking at the attributes. Could you please check to see whether you have indices which match between the two pieces of code? I think JOGL’s is probably zero-based while that being printed from GLUT is one-based. If so, please change the GLUT one to be zero-based by subtracting 1 before printing it so we know what we’re talking about. It would be instructive to write your own GLCapabilitiesChooser which forcibly chooses a particular pixel format so you can make JOGL’s match GLUT’s.

Again, I don’t think that’s the root cause of the slowdown.

I also doubt that multithreading issues are the cause of the slowdown. JOGL explicitly forces all OpenGL work onto one thread internally (when using the GLEventListener callback model) because of stability problems with multiple vendors’ drivers in the face of multithreading. I think it’s probably something going on in the AWT.

I did write my own GLCapsChooser, forcibly choosing any one of the available PFDs. Admittedly not on the GeForce 4600 system, but the 7800GT one. Anyway it makes no difference to performance when using PBOs.

I tried building Mustang but am getting swamped here with strange error messages. I don’t want to bore you with the details but just in case you’re interested: somehow, I can only run ‘make’ from within Cygwin’s bash shell (not from cmd.exe directly) even though everything should be on the PATH. I have also set up all other env vars correctly (I think), but am not getting past the sanity check. Also it is trying to build the 64-bit target (am running XP x64 here) whereas I really only want the Win32 target (haven’t had time to try and build JOGL for 64-bit). Needless to say that VC7 won’t build 64-bit binaries so god knows why it’s trying to do that.

Apologies if all this sounds very incoherent…

I’m not sure that building Mustang is the most expedient way to track down what’s going on, but I’d be glad to try to help you get it working. You might also consider posting on the Mustang feedback forum. I think it is to be expected that you have to run make from within the Cygwin shell on Windows. Please post the output from the sanity check failure. I think you can override which architecture the builds produce with “make ARCH_DATA_MODEL=32”. I think you will need to first build the HotSpot sub-portion of the workspace (unless you started with the “control” build, which I have less experience with).

Cool. As far as I can see, AwtComponent (src/windows/native/sun/windows/awt_Component) is what I should be looking at. Roundabout 8,000 lines of C(++) code!

There is AwtComponent::FillClassInfo(WNDCLASSEX *) which looks vaguely familiar, and other stuff too. But I clearly lack the expertise to even remotely understand where the problem could be. You mentioned you could help me to hack the GLUT code. Do you have any concrete ideas?

I am tempted to write a non-GLUT Win32 testbed for my PBO problem, to have one code base less to worry about.

If you can point me either to your hacked GLUT or where you started from I can look at it with you. I think the first thing to try is to add all of the flags from the AWT’s WNDCLASS to GLUT’s to see if it’s one of those which is causing the problem.