future jni optimisations

Hi all.
Anyone around has any document or knowledge of how jni is going to evolve in next years so that its calls get lighter, if possible?

[quote]Hi all.
Anyone around has any document or knowledge of how jni is going to evolve in next years so that its calls get lighter, if possible?
[/quote]
(1) No there is defintiely no ducmentation.
(2) This is to some degree a VM specific thing.
(3) The synatx of JNi is designed to protect the VMs state from errant C code and thus is nto likely to change.
(4) Used correctly for msot thinsg the over-head is minimal now. Native Direct Byte Buffers really solved the major known problem. What are you thinking is still an issue and why?

The issue is the huge complexity and major PITA of dealing with “trivial” calls. What I expect Pepe’s after (just reading his mind here) is something like:

public native void someDLLMethod(int x, int y) library “blah”;

which does a trivial mapping to the method someDllMethod(int x, int y) in blah.dll/so.

That’s about 90% of the hassle with JNI. The rest is fairly unavoidable although the API might conceivably be neatened up a little with a bunch of useful extra macros and shortcuts like throwing a named exception.

Cas :slight_smile:

Hi Jeff. I missed your scrambling. :wink:

oh, indeed i was pretty unclear. (let’s shamelessly put that on late hour)
Let me rephrase this…

Hem… toss toss… Anyone has any insight on how hotspot now handles jni calls and if that is going to change in any direction (faster calls being the best direction, of course)
(remove the ‘coding for specific VM is bad’ arguments, this is not the goal)

To respond to your answers and question…
When saying “used correctly”, i recall that this used to be “used as few as possible”. For jogl where jni calls occur for each GL method, this was not that pretty. Jogl is what i estimated as worst case of jni use due to the gazillion+1 calls that ought to occur in complex renders. (jogl is an example, i think i could name others)

I seem to remember vague fragments of discussions, some time ago, about how calls could be inlined with natively compiled code so that they finally would be cost free. Is that finally real? did i dream? mhhh…

The question has no real goal, i’m mostly checking the state of the art and resynching with latest news.

JOGL does not in fact map every call to a JNI call.

All data transfer is done through Native Direct Byte Buffers. The calls
across the boundary are minimized and the really expensive thing-- moving data in and otu of the VM, is alomost non-existant.

AFAIK JOGL today is getting roughly the same practical speed as the equivalent OGl C bindings. If you have benchmarks to the cpontrary we should get them to Travis and Ken to look into.

The issue isnt how the call is made, per se. Its what has to be done to preapre the VM before the call and deal with the results. As i mentieodn though the real siginificant costs are moving data cross the Java-heap/C-heap boundary because that involves locking the Java data down and copying it.

Native Direct Byte Buffers solved that for us.

I have some (old, JDK1.3 timeframe) benchamrks on all of this in the book. As its all online you might want to look at that.

[quote]Anyone has any insight on how hotspot now handles jni calls and if that is going to change in any direction (faster calls being the best direction, of course)
[/quote]
Yes I can tell you the grotty details if you like.
JNI calls are not going to get faster any time soon.

They can be made fast in some very specific situations, such as in embedded single-cpu devices with fixed application sets. Or if you can match the native compiler’s IR and the Java JIT’s IR (an extremely tough thing to do; IBM could do it; Microsoft is trying to do it with .Net, not many other instutions can even try).

Those are the ones i prefer, so please, yes. :slight_smile:

Cas: i agree that there’s really nothing we can do about that. Directly calling a native library would sure be nice to have, but… mhhh… i feel that making it too easy to link to native would make people rely on them too much for bad reasons (insert any general newbie misconceptions). Hassle is not that big and i feel it would in fact only displace it. well, imho, of course.

[quote]JOGL does not in fact map every call to a JNI call.
[/quote]
Oh. That is not what i concluded when looking at sources and at how gluegen works. Can you drop some more about this? It looks i’ve been mistaking myself, then.

[quote]Directly calling a native library would sure be nice to have
[/quote]
I worte few years ago a library for that in assembler, it would be able to do even command chaining.

Other thing that it could do is creating executable code in the memory, stop the computer NOW to protect it against overheating, for example, load a native dll and find a method from dll, and so on. I thought about additing a method that could confortably with GUI modify NTFS table, or something that is on partition of HD at the start, directly without possiblity to be restrained by OS, but I didn’t need it too much for current Java programs.

The problem that might arise is difference in JNI calls between server and client code, if’s still here. (Of course if you distribute JOGL, you might distribute server as well, so it’s not as much big isue)
Actally they are under 48 CPU cycles IIRC. Could someone try that benchmark on his own computer?.

It might be interesting to see dissasembled JVM JNI call code, for example from JOGL.

BTW what is difficult at simple:

System.loadLibrary(“someLibrary”);
private native void someCall();

[quote]Yes I can tell you the grotty details if you like…
Those are the ones i prefer, so please, yes.
[/quote]
Ahhh… it’s a big story. I’ll make shortcuts. You can ask for more details.

Background: it’s a desktop or server environment, so pre-emptive multi-threaded is normal, not cooperative scheduling. Blocking native I/O. Precise & moving GC (conservative non-moving GC allows for faster JNI calls). Many of these steps do not apply in an embedded system, or one with cooperative suspension, or single-threaded applications, or single CPU or the native compiler is known & trusted, or the native call promises not to block (or at least statically known to block or not) etc…

Since a native call might block or otherwise take a long time, and since other threads are running and allocating, a GC must be allowed while the thread is in native code. A moving GC makes the mutators more efficient - but there’s no way to find & move the pointers in held by the native code/compiler. So you don’t hand out any raw heap pointers, you hand out Handles. The native code is never allowed touch a heap pointer lest an ill-timed GC move things the native code is also touching, crashing the native code. Upon return from the native code a GC may be in progress. The regular Java code threads have been stopped for the GC, but the threads in native code have not - so they have to do a lock-like thing to assure that a GC is not in progress (or doesn’t start up). Basically, you must CAS on exit from the native or you have a race condition where a GC might think all Java threads have been stopped & start moving objects, and a thread returning from a native (where GC is ok) into Java code (where GC is not) starts touching objects.

Also, you have to be able to find all the objects - including those passed as arguments on the stack and Handlized. Generally this is done by maintaining a mapping between PC’s and which registers hold oops - but HotSpot doesn’t know where the native compiler will put the JNI call’s PC. So we need the return PC (return from the native code back into the wrapper code) jammed down before we act “as if” we’re in native code and allow a GC. Same for the stack pointer.

Finally, argument calling conventions between JIT’d code and native code are usually different - the JIT doesn’t need var-args support, or legacy calling-convention support, or a JNIEnv. Putting it together we get:

push a frame
store any objects passed in registers down (prior to handlizing)
if a sync method, lock
make a copy of the arguments:
add the JNIEnv argument
copy the rest from JIT convention to native convention (generally reg-reg moves on a RISC, stack-stack moves on X86)
wrap a handle around objects (requires a null-test/branch)
store the return-pc somewhere (generally thread-local storage)
maybe memory fence, in case GC is somewhat concurrent
store the sp nearby - which also allows a GC, and requires the objects & PC be coherent in memory
do the native call
reclaim the GC lock (generally a CAS) or block
if sync, unlock
if returning object, de-handlize
pop frame
return

Clear as mud, I hope?

Right, so - there’s absolutely no reasonable reason not to allow super easy native library integration as I described above? ie. direct integration with an existing, non-JNI dll.

Cas :slight_smile:

[quote]Right, so - there’s absolutely no reasonable reason not to allow super easy native library integration as I described above? ie. direct integration with an existing, non-JNI dll.

Cas :slight_smile:
[/quote]
To diverge some more, what would bother me with such a thing would be how to manage alignment and size of data structures? Also, how, for instance would you handle the C int changind size between two different dlls compiled by different compilers?
I’m sure there would be other ‘little things’ that would make it undoable.

Thanks for the details, cliffc. It gave me some useful insight of why it would not change soon.

There are no alignment issues; either the parameters match correctly, assuming that their types can be read from the DLL, or they don’t; if they don’t or can’t be read, you get an UnsatisfiedLinkError - or perhaps we just shove the parameters on the stack and hope for the best. We must remember that the point at which we are talking to native code we are already in crazy C land and anything that goes wrong is really the responsibility of the developer to sort out.

Cas :slight_smile:

[quote] System.loadLibrary("someLibrary"); private native void someCall();
Right, so - there’s absolutely no reasonable reason not to allow super easy native library integration as I described above? ie. direct integration with an existing, non-JNI dll.
[/quote]
Right - you just did it. Since this call has no arguments, the wrapper is fairly cheap (please ignore imperfect use of CAS registers, been awhile since I looked at X86 CAS):
`
SUB SP,xx # push a frame, must have one for GC stack crawls
MOV EAX,FS:[18] # idiot sequence to get TLS
MOV [EAX+xx],0xpcbits # store the return-pc in TLS

no fence needed for most PC’s; memory model too strong

MOV [EAX+yy],ESP # store the sp nearby, allow GC
CALL native # do the native call
MOV EBX,FS:[18] # idiot sequence to get TLS
MOV EAX,ESP # Find old SP if no GC in progress
CMPXCHG [EBX+yy],0 # Slap down a NULL if re-entering Java code
JNE go_block_GC_in_progress
ADD SP,xx # pop frame
RET
`

You want less code than that? Then you have a tall order on your hands - you want the JVM to have intimate knowledge of some random entry point. To “Trust Me - it won’t block and I’ll put the PC here” - which bitter experience tells me is foolish.

I’ve debugged any number of JVM “bugs” which were bugs in 3rd party native code caused by the 3rd-party vendor not understanding the invariants used by the JVM. JNI goes a long way to making native code safe for the JVM.

It’s obvious he’s a typical PHdr. Lots of abreviations that he believes everyone must know.
So a little education. Because here are game developers, the most common cases for these are:

PC - player character.
IR - infra red (aka IR laser, or IR matrix based homing AA missile)
CAS - that weird things in memory. You know that option from BIOS CAS before RAS, Right? (of course someone might have even Close Air support on the mind, but it’s clearly not this case.)

sp - looks like it means stack pointer.

~_^

Cas why reasonable reason? We might be satisfied with unreasonambe reason. You should increase your expectations from JVM developers. :slight_smile:

[quote]It’s obvious he’s a typical PHdr. Lots of abreviations that he believes everyone must know.
So a little education. Because here are game developers, the most common cases for these are:

PC - player character.
IR - infra red (aka IR laser, or IR matrix based homing AA missile)
CAS - that weird things in memory. You know that option from BIOS CAS before RAS, Right? (of course someone might have even Close Air support on the mind, but it’s clearly not this case.)

sp - looks like it means stack pointer.
[/quote]
You’re probably being sarcastic, but just in case:

PC = Program Counter = a hex number that a computer keeps so that it knows where in the code it is.
IR = Internal Representation = A half way point between source and assembly. Most Compiler optimizations are done to the IR
CAS = Compare And Swap = An atomic operation (it will happen without external interference), that’ s used to grab locks/mutex/semaphore.

Ask Ken in the JOGL forums and he’l ltell you all about ti in much too much detail Im sure.

Ken was, in fact, the original proponent and implementor of Native Direct Byte Buffers and it was specifically to solve this problem.

[quote]PC = Program Counter = a hex number that a computer keeps so that it knows where in the code it is.
[/quote]
Thanx. I’m often using either EIP, or words “execution point”, for this. It’s handy especially if you have multiple of them.

[quote]Thanx. I’m often using either EIP, or words “execution point”, for this. It’s handy especially if you have multiple of them.
[/quote]
IIRC, EIP stands for extended instruction pointer - it’s just Intel’s specific register naming for the x(3)86. My understanding is that PC/program counter is the prevalent term in the industry. If you haven’t read it already, you might want to take a look at “Computer Organization and Design” (Patterson and Hennessy).

(http://www.amazon.com/exec/obidos/tg/detail/-/1558606041/ref=pd_sim_b_4/002-4303307-4332012?_encoding=UTF8&v=glance)

Even if you have a lot of industry experience, I think this book can help you solidify a lot of your understanding of computer architecture and give you a vocabulary you can share with others. (There’s also “Computer Architecture: A Quantitative Approach”, for more detailed study).

God bless,
-Toby Reyelts

Program counter implies a single register for a program.
However, there could be multiple execution points in a program. So program conter is bad name (missleading) for this feature.
If you know an e book variant of Computer Organization and Design it might be interested to look at it, but I don’t think propagation of a wrong, non intuitive term is reasonable action.

[quote]IIRC, EIP stands for extended instruction pointer - it’s just Intel’s specific register naming for the x(3)86.[quote]
IIRC they should name it as a RIP in 64 bit architecture. they renamed eax to rax, ebx to rbx and so on.
Rest in pieces excecution point of a program at address xxxxxxxxxxxxxxxx, lets hope the next programmer’s attempt would be better. :slight_smile:
BTW prevailed term for 32 bit x86 architecture is IA32. If AMD would get lead in 128 architectures we might see AA128 type of architectures.

[quote]If you know an e book variant of Computer Organization and Design it might be interested to look at it
[/quote]
Sorry - I’m not aware of one. If you live near me, I can loan you my copy. :slight_smile:

[quote]but I don’t think propagation of a wrong, non intuitive term is reasonable action
[/quote]
LOL.

God bless,
-Toby