JOGL JSR implementation performance with JNI

JNI part of JOGL would become part of the standard JDK and so I assume Sun will allow inlining and optimisation of JOGL’s JNI stuff, therefore making it a lot faster?
Not that I have performance issues with JOGL but I’d like to know if JNI code will be inlined like the rest of Sun’s JNI code.

ALSO: Ken Russel you have a PM.
Thanks.

The performance of the current JOGL and JNI in general are just fine in my opinion. We built a prototype of JOGL earlier this year using a radically different and more efficient native method calling interface than JNI and weren’t able to see any significant speedups on Jake2 on modern processors so it was hard to justify pushing the prototype further. If you have a real-world application showing a significant performance difference between some C/C++ OpenGL code and JOGL-based OpenGL code which is attributable to JNI overhead then please post or file a bug about it and we’ll be glad to look into it.

Recent change in JOGL (JSR-231), the one which made all NIO Buffer pointers significant made me recode good chunk of GL invocations from within a tight context, to C++.

In the process, I was also wondering myself if a noticeable performance improvement would result of this change.

Suprisingly, the performance angle of this change was insignificant. (I always was wondering how JNI affects performance lately; it appears JNI has improved)

On a quick note, if not secret, what is this radically new native calling interface of which you speak, Ken?
I have looked at CNI briefly, but after determining for myself that improvement from coding GL invocations in tight loops to C++ results were almost negligible, I pretty much left with impression that JNI does the job.

Hi Ken,

nice to hear that you use Jake2 as a benchmark.

Have you tested your prototype with the renderer named “jogl”?
This one produces a lot of simple OpenGL calls. (ca. 25000 per frame)
If you want to stress the JNI layer, you can run the demos with 32 player models.
Type to console:
timedemo 1
cl_testentities 1
map q2demo1.dm2

The “fastjogl” renderer uses FloatBuffers and vertex arrays. (ca. 5000 OpenGL calls per frame)
Thats why its possible that you can’t see any significant speedups.
But this impl uses a lot of FloatBuffer puts and gets.
(for FloatBuffer stress tests you can use this renderer (or lwjgl) and the same commands as above)

Are there any optimizations for FloatBuffers in jdk1.6.0?
(like the MappedByteBuffer-cast-trick for direct ByteBuffers)

bye
Carsten

We used the jogl renderer and the timedemo mode, but didn’t try / know about the cl_testentities command. We tested on a few different processors and for any recent processor with good branch prediction (Pentium 4, Pentium M, or Opteron) the speedup with the faster native interface layer was no more than 5% to the best of my recollection. This was several months ago however so I don’t remember all of the details. I do remember that for any processor without full SSE2 support the speed difference between Java and native C code was huge due to the inefficient code that must be generated in order to make the Intel x87 floating-point unit produce Java-compliant floating point results.

As mentioned above we deliberately used the jogl renderer, not the fastjogl renderer, to do our comparisons.

The Java HotSpot server compiler implements bimorphic call inlining in Mustang meaning that if there are for example two hot data types at a particular call site it will inline both with a type test at the top. This speeds up certain uses of NIO where direct and non-direct buffers are mixed in the same application.

Hi Ken, would you care to elaborate on what you
mean by the above? AFAIK there’re still a lot of CPUs
out there that do not have SSE2 and we’ll have to
support those and make sure there’s an acceptable
level of performance for them.

So my questions are:

  1. How is SSE2 used in JSR-231?
  2. How huge are the performance differences and what’s the context?
  3. ‘inefficient code that must be generated’ ?

Many thanks in advance!

.rex

This is a red herring; we are talking about sub 5% performance degradation and if the machine doesn’t have SSE2 it’s also likely not to have a particularly speedy graphics card anyway and that’s probably going to be the bottleneck.

Cas :slight_smile:

5% JNI performance impact would equate to 5% hit in a very small part of the code (JNI layer).
It’s added latency at worst.

The Quake II engine is fairly floating-point intensive and when the x87 floating-point stack is used for floating-point computations there are frequent stores of intermediate floating-point values to and from main memory in order to make values visible to the program IEEE compliant. It is a well known fact that this incurs a significant amount of overhead in FP-intensive Java programs and while I don’t have any concrete numbers in front of me it can easily be much more than 5%. Do a Google search for “java floating point x87” for references. The good news is that with the Pentium III and later, single-precision Java FP computations can use the IEEE-compliant SSE registers, and with the Pentium IV and later, both single- and double-precision Java FP computations can use the IEEE-compliant SSE2 registers. The use of these registers eliminates this overhead.

As I stated above on a PIII the overhead of using the x87 expression stack and associated stores to main memory appeared to be very high; Jake2 ran at roughly half the speed of the C version. On this processor eliminating the JNI overhead yielded roughly a 15% speedup to the best of my recollection, but the scores were still pretty far from those of the C version. On a Pentium M, Pentium IV or Opteron processor (all of which support SSE2) the differences between the Java and C versions of Quake II were very small (actually, I think the Java version ran faster on some of the processors, possibly because it was using the SSE registers and the C version wasn’t) and eliminating the JNI overhead didn’t yield a significant speedup, indicating that the better branch prediction in the P4 and later processors is already doing a good enough job of reducing the multiple function call overhead of JNI.

Although moving bulk of GL invocations down to native on a P3 @ 500 didn’t yield a very signifant improvement, I’m still persuing this path namely due to the scenario in which I’m in:

Basically with JSR-231 I have to synchronize two threads, and although lack of synchronization in general ca be considered dangerous, it works very well and is stable in JOGL 1.1.1. I have one thread which is a filler, and it pipes into a direct ByteBuffer new texture data every frame. Another thread is a renderer, which actually sends this ByteBuffer to GL.

I must disclaim that this problem is very domain-specific, so don’t take my writing as sour grapes, rather as an experience.
In JOGL 1.1.1, at worst you can get some garbled data which in itself is very unlikely, but the application is stable.

In JSR-231, due to the pointer significance (sorry if sounds like dead record by now) the app crashes indeterministically. (Especially pronounced on my dual test machine)

By adding synchronization to this process, I’m sure the stability problem can be eliminated; with a good bit of performance. Although I’m mostly speaking from experience, and haven’t actually taken the sync route in this case, what I did was, basically move GL invocations that were required to native layer.

This has removed the pointer signifance problem and even yielded an improvement, albeit slight one (not too many invocations, and not FP-intensive). Although I said that performance improvement was slight, this is only when dealing with a single stream; I have to process multiple streams, so improvement adds up quickly. Overall, I’m happy with this development.

So basically Java end of things handles all the management of textures, shaders, and other data … and establishes the GL context on the EDT, then I call into native and do the actual scene invocations from there.

Anyway, 15% JNI overhead if it is correct, IMO is still quite significant. I guess part of me still feels that I should sqeeze out as much as I can, since I’m targeting lowish-end hardware (CPU/GPU) (I’m developing on a 500P3 using Radeon 9k)

Good to know your statistics though.

How many OpenGL calls are you making per stream? I would have expected something like five, like glVertexPointer, glNormalPointer, glTexCoordPointer, glDrawElements. I have a hard time believing that trading so few native method calls for one (are you calling down in to native code to make your OpenGL calls, or do you have a loop down in your native code?) will have a significant performance improvement.

Likewise for synchronization. It is possible to perform many, many monitor notifications per second and if the app is structured properly then you could have a round-robin pool of buffers to fill so that your data streams down from your compute thread to your rendering thread without forcing the compute thread to explicitly rendezvous with the rendering thread (unless the rendering thread doesn’t keep up and you run out of available buffers in the compute thread). The Java2D OpenGL pipeline works like this to the best of my knowledge and I have personally worked on Java-based 3D apps which did this (see [url=http://characters.www.media.mit.edu/groups/characters/papers/bfg.pdf]this paper[/url) to achieve high frame rates on 1998-era JVMs and hardware. Basically my point is that you should avoid prematurely optimizing your application by making it unsafe (eliding necessary synchronization) but instead first concentrate on making it correct.

I’d be interested to know how your project and its performance is going. Please post updates as your work continues.

How many OpenGL calls are you making per stream?

Worst case scenario (Using all 6 texture units on the R200 class hardware) I’m making on the order 70 GL calls per stream.
3 of the textures are planar YUV data, the other 3 textures are RGB gamma ramps. Shader does all the conversion.

Then I have to send some geometry, not to mention setup state (6 texture units), render, teardown. So, if using gamma correction ~70 calls, per stream, per frame.

Additionally, there is world setup, render, teardown (shadows, etc). So all in all, quite a few GL calls. Although I’ve only moved the stream-based GL invocations into the native world, rest is still done from Java. Its important to underline my dedication to doing as much as I can from Java, but sometimes I just have to do what I must.

As for circular buffer, yep using it. And as for synchronization, trust me where it really counts, I’m using all proper threading techniques. I can skimp on the native buffers since the behaviour seems to be defined, and is thouroughly tested (at least wrt JOGL 1.1.1) over a period of time. As of premature, I guess it can be considered, although some parts are mature in terms of features and I’m doing optimizations on those parts in band with development of other elements of the system, which overall is quite diverse.

Also, I’ve made a simple testcase about the JDesktopPane behaviour, if you want to take a look at it. I’m going to look at it this weekend, as its the only time I get to fix bugs. fun fun.

Regards.

EDIT:

As for loop in native code, yes I have a version which does JAWT acquisition/release from native code. But so far I’ve found this method to be very unstable, so as of now I’m letting JOGL manage GL context. I intend to do more research into this when time allows, but this I do consider to be a premature optimization as overall the performance (with all latest developments) is acceptable.

For some reason I seriously don’t believe you, at all. You must be abusing the GL API horribly to get anywhere near this kind of performance degradation through JNI. In fact I’m absolutely 100% certain of it, so perhaps you could find out where you’re abusing it and fix that?

Cas :slight_smile:

What?

See if I care whether or not you believe me. Here I think I’m having an intelling discussion, only to be accused of lying for (insert your reason).

Probably because there’s really no way that the overhead of 70 method calls per frame can that much significance.

Maybe if you were doing thousands of operations, then sure…

Somehow the premise got skewed here…

I’ve never said that overhead was tremendous. However, it is my job to sqeeze every ounce of performance out of the CPU (since the CPU is going to be doing a heck of alot more than simply pushing verts to the GPU, infact one of the reasons I went GL is to avoid the rendering hit), and since I’ve ported ton of native code to Java already, for reasons which I won’t even begin to list, I’m going to stick to it, making appropriate adjustments where necessary. (Its been a hybrid all along, heh, politics play no role in development of this project)

I think earlier attempt , or should I say jab, was coarse; I couldn’t care less.
I’ve already said that my task is rather specific, and without being in position to know what I must accomplish, simply throwing out baseless assumptions is kind of silly.

I joined these forums when I started out with JOGL, to report a bug. Forums like these shouldn’t have a barrier to entry.

Auf Wiedersehen

Getting back to the topic at hand, have you considered modeling the OpenGL state in your engine to avoid possibly-redundant native method calls to OpenGL? This might reduce the 70 method calls per stream to something smaller so that as your number of streams gets larger you have overall fewer calls to make. Additionally you might consider sorting your streams if you aren’t already so that similarly-enabled ones are rendered sequentially to take advantage of such an optimization.

I’ve done some preliminary state management, such as enabling the textures backwards, coalescing calls, among other minor things. Good suggestion, I will look into it more thouroughly for sure.

The gamma corrective op that runs on r200 seems kind of bloated, but its not. There was an “option” of packing RGB ramps into a single texture, (saving some tex bind, state calls) but in order to do a dependent texture fetch on radeon 9k, which doesn’t support ARB fragment programs, but rather a semi-limited (instruction count wise, 8 per pass) ATI proprietary shader, I kept running out of instructions in the second pass to do a dependent texture fetch. (first pass is maxed out for YUV & EQ)

So thats why I opted for using 6 textures, each of the textures (top 3) contains the R,G,B ramps while lower textures contain the YUV planes. I’ve also considered packing the YUV into interleaved format, but that would induce a hit on the CPU since data comes from source in a planar format, so things like that I’ve been mucking with. (cpu will be doing quite a bit of DCT, so no go) (also trying to minimize the amount of contenders for the branch prediction table, I hear even on modern CPUs its a limited resource, so with all those other things looping everywhere, cache coherency can be a problem)

I have a branch (Src Control) which does most of this stuff using ARB programs where you have the POW instruction, so gamma correction does not require dependent fetch into separate ramps … and its so much nicer, I must say. (as well as other optimizations which I couldn’t pull on such ancient HW, but its a requirement)

I figured if r200 has 6 tex units, why not use them. So with those 6 units, I must enable each, setup pixel alignment & unpack modes, …then setup VBO, send some geomertry and restore states.

All in all, I want to minimize amount of time that gets spent in GL overall, but its inevitable (feature creep made me add features I wouldn’t of otherwise opted for myself) so I’m spinning wheels looking for ways. But, I’ll be doing some thorough profiling to iron out the remaining hotspots, currently few preliminary profiler passes have been done.

There was a good quote about time or lack thereof, can’t remember now. So much to do, so little time.

I also agree - something else is causing the performance problem. I would go to a profiler for this one to see what’s causing the delays or strip out everything except the calls you think are causing problems, then test your FPS. My guess is that even in a PIII 500 you will be over a 1000 FPS with just 70 calls and an ATI9000. Using stencil shadows for example will cause a huge fillrate hit and some nasty state changes (and probably CPU depending on how you did it), depending on the textures and your bus speed and video ram, that can also be a huge hit, what’s the texture filtering - another potential hit…you said in another part that you have a thread rendering to a byte buffer, then another thread uploading that - if that’s per frame that will be a HUGE ugly, must never do on low end computer type of performance hit…etc…etc.

I think people are missing the point that this application has 70 OpenGL calls per object, not per frame. If lots of objects are on the screen then that factor of 70 can add up pretty quickly. I generally agree that JNI overhead should be pretty minimal but it doesn’t sound like this is an unfounded complaint.