[quote=“nsigma,post:40,topic:55436”]
We did that in LWJGL 2. LWJGL 3 talks to native code exclusively with primitives and never calls NewDirectByteBuffer (or any other JNI function, except in callbacks). The buffer instances are constructed via .duplicate() and overriding the address/capacity fields with Unsafe (plus appropriate fallbacks when Unsafe in not available). Details here.
There are some fundamental differences between those libraries concerning the use of third party libraries and external tools. The JogAmp APIs let the developer decide what to do when a buffer is destroyed and we don’t plan to support any alternative mean of creating direct NIO buffers.
The first part is already done in JogAmp’s Ardor3D Continuation but I prefer detecting this case even before the garbage collector. I think that I will go on using Sun internal APIs at least for Java 1.7 and Java 1.8. If something better appears in the standard Java API, we’ll use it in JogAmp. I remind you that some of our APIs must work in tons of environments (there is even at least one user who tries to run JOGL on Wii and PS3), we can’t rely on a library that doesn’t work under Android for example.
[quote=“Spasi,post:41,topic:55436”]
Something I noticed in LWJGL 2 was that when mapping buffer the ByteBuffer returned was recreated if either the VBO address or VBO size was updated. Does this mean that the ByteBuffer can be reused indefinitely now?
[quote=“theagentd,post:43,topic:55436”]
Yes, the LWJGL 3 implementation doesn’t care whether the address/size has changed or not.
What, never?
Actually, I’m mainly thinking about situations where direct buffers are allocated outside of the library (JOGL) itself. In those cases, it’s not like it makes any difference to JOGL where the buffer comes from - eg. I’m passing GStreamer video direct to JOGL textures using JNA creating direct buffers - it’s all just pointers!
It seems like various projects besides LWJGL are exploring alternative allocators - just saying that a tiny library that does just that, in a pluggable way, would seem to be useful.
If there’s no malloc / free to fall back to, you might have bigger problems?! Mind you could also just fall back to the allocateDirect()
Interesting! My naive hope of ignoring Unsafe appears misguided. ;D Any benchmarks of the impacts?
I don’t know. We should talk about that to the other contributors. Personally, I don’t want to provide an alternative allocator within JogAmp, I’m against this option.
We do nothing to prevent the developers from using any alternative allocators as far as I know. As long as you don’t pass any arrays to any methods accepting preferably direct NIO buffers, JOGL won’t create any direct NIO buffer except when using some utilities. There is almost nothing to plug then. You can already use JNA, Apache DirectMemory or another library with JOGL, what more do you expect? I could fill a RFE about com.jogamp.common.nio.Buffers to allow to pass a custom allocator here:
[quote=“nsigma link=topic=36614.msg348111#msg348111 date=1441875702][quote author=Spasi,post:45,topic:55436”]
We did that in LWJGL 2. LWJGL 3 talks to native code exclusively with primitives and never calls NewDirectByteBuffer (or any other JNI function, except in callbacks). The buffer instances are constructed via .duplicate() and overriding the address/capacity fields with Unsafe (plus appropriate fallbacks when Unsafe in not available). Details here.
[/quote]
Interesting! My naive hope of ignoring Unsafe appears misguided. ;D Any benchmarks of the impacts?
[/quote]
It was not done for performance. LWJGL 3 has the explicit goal of having minimal native code, for two reasons:
- Attract more contributions from Java developers that have no C experience.
- Make the transition to Project Panama (JVM FFI) in Java 10 as painless as possible.
This design has the nice side-effect that the JVM is able to inline and optimize a lot of code that was previously hidden inside JNI functions. The JNI code now does nothing but call the native function. With Panama, the JVM will be able to inline all the way to the native function call, completely eliminating JNI overhead. I’m also hopeful that using Unsafe won’t be necessary by then.
Finally, this has allowed us to deduplicate JNI methods, resulting in important space savings in the native binaries. Details here.
A note on LWJGL 3’s jemalloc support: it’s optional. You can delete the native binary and LWJGL will work. It will simply fall back to the system’s malloc/free/etc. Which is what NIO/Unsafe uses, minus the VM housekeeping overhead.
noctarius, is my suggestion (putting the call to sn.misc.Cleaner.clean() into java.nio.DirectByteBuffer.free()) completely nonsensical?
Well I think it’ll never happen. It still is a specific use case and you don’t make people expect that to be used. A method called free() always sounds like you actually have to free the buffer.
[quote=“nsigma,post:45,topic:55436”]
I took some time to test this and have verified that the JVM is able to eliminate ByteBuffer allocations via escape analysis. Simple benchmark:
public static void main(String[] args) {
// warmup
for ( int i = 0; i < 100; i++ ) {
testImpl();
}
// bench
long t = System.nanoTime();
for ( int i = 0; i < 1000; i++ ) {
testImpl();
}
t = System.nanoTime() - t;
System.out.println("TIME: " + t / 1000 / 1000 + "ms");
}
Tests and results:
// Reference implementation using Unsafe:
private static void testUnsafe() {
long target = nje_malloc(8);
for ( int i = 0; i < 10000; i++ ) {
long source = nje_malloc(8);
memPutInt(source + 0, 0xDEADBEEF);
memPutInt(source + 4, 0xCAFEBABE);
memCopy(source, target, 8);
nje_free(source);
}
nje_free(target);
}
/*
TIME: 668ms
Heap
PSYoungGen total 38400K, used 2665K [0x00000007d5d00000, 0x00000007d8780000, 0x0000000800000000)
eden space 33280K, 8% used [0x00000007d5d00000,0x00000007d5f9a6f0,0x00000007d7d80000)
from space 5120K, 0% used [0x00000007d8280000,0x00000007d8280000,0x00000007d8780000)
to space 5120K, 0% used [0x00000007d7d80000,0x00000007d7d80000,0x00000007d8280000)
ParOldGen total 86016K, used 0K [0x0000000781800000, 0x0000000786c00000, 0x00000007d5d00000)
object space 86016K, 0% used [0x0000000781800000,0x0000000781800000,0x0000000786c00000)
PSPermGen total 21504K, used 3082K [0x000000077c600000, 0x000000077db00000, 0x0000000781800000)
object space 21504K, 14% used [0x000000077c600000,0x000000077c902958,0x000000077db00000)
*/
// ByteBuffer implementation, using je_malloc for malloc/free
private static void testLWJGL() {
ByteBuffer target = memAlloc(8);
for ( int i = 0; i < 10000; i++ ) {
ByteBuffer source = memAlloc(8);
source.putInt(0xDEADBEEF);
source.putInt(0xCAFEBABE);
source.flip();
target.put(source);
target.flip();
source.flip();
memFree(source);
}
memFree(target);
}
// Results with default JVM arguments
/*
TIME: 693ms
Heap
PSYoungGen total 38400K, used 5330K [0x00000007d5d00000, 0x00000007d8780000, 0x0000000800000000)
eden space 33280K, 16% used [0x00000007d5d00000,0x00000007d62348c0,0x00000007d7d80000)
from space 5120K, 0% used [0x00000007d8280000,0x00000007d8280000,0x00000007d8780000)
to space 5120K, 0% used [0x00000007d7d80000,0x00000007d7d80000,0x00000007d8280000)
ParOldGen total 86016K, used 0K [0x0000000781800000, 0x0000000786c00000, 0x00000007d5d00000)
object space 86016K, 0% used [0x0000000781800000,0x0000000781800000,0x0000000786c00000)
PSPermGen total 21504K, used 3091K [0x000000077c600000, 0x000000077db00000, 0x0000000781800000)
object space 21504K, 14% used [0x000000077c600000,0x000000077c904e90,0x000000077db00000)
*/
// Results with -XX:-DoEscapeAnalysis
/*
[GC [PSYoungGen: 33280K->400K(38400K)] 33280K->408K(124416K), 0.0009441 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[GC [PSYoungGen: 33680K->368K(38400K)] 33688K->384K(124416K), 0.0008054 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[GC [PSYoungGen: 33648K->352K(38400K)] 33664K->376K(124416K), 0.0006958 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[GC [PSYoungGen: 33632K->368K(71680K)] 33656K->392K(157696K), 0.0007315 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[GC [PSYoungGen: 66928K->352K(71680K)] 66952K->376K(157696K), 0.0009498 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[GC [PSYoungGen: 66912K->384K(133632K)] 66936K->408K(219648K), 0.0007236 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[GC [PSYoungGen: 133504K->32K(128512K)] 133528K->352K(214528K), 0.0007481 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[GC [PSYoungGen: 128032K->32K(123904K)] 128352K->352K(209920K), 0.0003752 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[GC [PSYoungGen: 123424K->32K(119808K)] 123744K->352K(205824K), 0.0004305 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
TIME: 808ms
Heap
PSYoungGen total 119808K, used 42952K [0x00000007d5d00000, 0x00000007de000000, 0x0000000800000000)
eden space 118784K, 36% used [0x00000007d5d00000,0x00000007d86ea2d8,0x00000007dd100000)
from space 1024K, 3% used [0x00000007dde00000,0x00000007dde08000,0x00000007ddf00000)
to space 1024K, 0% used [0x00000007ddf00000,0x00000007ddf00000,0x00000007de000000)
ParOldGen total 86016K, used 320K [0x0000000781800000, 0x0000000786c00000, 0x00000007d5d00000)
object space 86016K, 0% used [0x0000000781800000,0x0000000781850050,0x0000000786c00000)
PSPermGen total 21504K, used 3091K [0x000000077c600000, 0x000000077db00000, 0x0000000781800000)
object space 21504K, 14% used [0x000000077c600000,0x000000077c904e90,0x000000077db00000)
*/
This optimization is not possible if you pass/return ByteBuffer instances to/from JNI methods, or use ByteBuffer.allocateDirect. For example, the same test with allocateDirect:
// ByteBuffer implementation, using ByteBuffer.allocateDirect for malloc
private static void testJava() {
ByteBuffer target = ByteBuffer.allocateDirect(8).order(ByteOrder.nativeOrder());
for ( int i = 0; i < 10000; i++ ) {
ByteBuffer source = ByteBuffer.allocateDirect(8).order(ByteOrder.nativeOrder());
source.putInt(0xDEADBEEF);
source.putInt(0xCAFEBABE);
source.flip();
target.put(source);
target.flip();
source.flip();
}
}
/*
[GC [PSYoungGen: 33280K->5104K(38400K)] 33280K->28272K(124416K), 0.0620751 secs] [Times: user=0.19 sys=0.00, real=0.06 secs]
[GC [PSYoungGen: 38384K->5104K(71680K)] 61552K->51640K(157696K), 0.0356250 secs] [Times: user=0.14 sys=0.00, real=0.04 secs]
[GC [PSYoungGen: 71664K->5104K(71680K)] 118200K->108720K(175616K), 0.0715550 secs] [Times: user=0.23 sys=0.00, real=0.07 secs]
[Full GC [PSYoungGen: 5104K->0K(71680K)] [ParOldGen: 103616K->59651K(166912K)] 108720K->59651K(238592K) [PSPermGen: 2530K->2529K(21504K)], 0.2818225 secs] [Times: user=1.02 sys=0.00, real=0.28 secs]
[GC [PSYoungGen: 66560K->5120K(101888K)] 126211K->121827K(268800K), 0.0715050 secs] [Times: user=0.31 sys=0.00, real=0.07 secs]
[Full GC [PSYoungGen: 5120K->0K(101888K)] [ParOldGen: 116707K->37099K(185344K)] 121827K->37099K(287232K) [PSPermGen: 2529K->2529K(21504K)], 0.1767194 secs] [Times: user=0.61 sys=0.00, real=0.18 secs]
[GC [PSYoungGen: 96768K->5120K(138240K)] 133867K->128955K(323584K), 0.0938719 secs] [Times: user=0.36 sys=0.00, real=0.09 secs]
[Full GC [PSYoungGen: 5120K->0K(138240K)] [ParOldGen: 123835K->51468K(253952K)] 128955K->51468K(392192K) [PSPermGen: 2529K->2529K(21504K)], 0.2487228 secs] [Times: user=0.88 sys=0.00, real=0.25 secs]
[GC [PSYoungGen: 133120K->70624K(237056K)] 184588K->122092K(491008K), 0.1287830 secs] [Times: user=0.38 sys=0.03, real=0.13 secs]
[GC [PSYoungGen: 206816K->72256K(238592K)] 258284K->123724K(492544K), 0.1258122 secs] [Times: user=0.36 sys=0.05, real=0.13 secs]
[GC [PSYoungGen: 208448K->72256K(293376K)] 259916K->123724K(547328K), 0.1384435 secs] [Times: user=0.39 sys=0.03, real=0.14 secs]
[GC [PSYoungGen: 261696K->100480K(293888K)] 313164K->151948K(547840K), 0.1869928 secs] [Times: user=0.59 sys=0.00, real=0.19 secs]
[GC [PSYoungGen: 289920K->100480K(344576K)] 341388K->151948K(598528K), 0.1922449 secs] [Times: user=0.53 sys=0.03, real=0.19 secs]
[GC [PSYoungGen: 329344K->121376K(352256K)] 380812K->172844K(606208K), 0.2226970 secs] [Times: user=0.67 sys=0.00, real=0.22 secs]
TIME: 4700ms
Heap
PSYoungGen total 352256K, used 282188K [0x00000007d5d00000, 0x00000007f8200000, 0x0000000800000000)
eden space 228864K, 70% used [0x00000007d5d00000,0x00000007dfa0b370,0x00000007e3c80000)
from space 123392K, 98% used [0x00000007e3c80000,0x00000007eb308000,0x00000007eb500000)
to space 137728K, 0% used [0x00000007efb80000,0x00000007efb80000,0x00000007f8200000)
ParOldGen total 253952K, used 51468K [0x0000000781800000, 0x0000000791000000, 0x00000007d5d00000)
object space 253952K, 20% used [0x0000000781800000,0x0000000784a430a8,0x0000000791000000)
PSPermGen total 21504K, used 2536K [0x000000077c600000, 0x000000077db00000, 0x0000000781800000)
object space 21504K, 11% used [0x000000077c600000,0x000000077c87a2e0,0x000000077db00000)
*/
Also tried with Cleaner.clean(), runs at about 2900ms which is still 4 times slower.
Thanks @spasi Guess I was more thinking about the difference in time of your second (jemalloc) test using MemoryAccessorUnsafe vs MemoryAccessorJNI though (which I think is roughly equivalent to targeting a specific VM vs targeting a generic VM?)
Doesn’t seem that much different to Closeable to me. Some resources require explicit life-cycle management and direct buffers should have been one of them!
Looking at the actual #asm output produced might be even more enlightening. (Roquen?)
Cas
Offtop: Lol hm - wher is a problem XD
With the exception that mishandling Closeables will cause memory leaks, whereas mishandling malloc/free causes native crashes and/or security issues.
The current code is already protected against double free (look for the word “paranoia” in the comments of the source code concerning direct NIO buffers, especially in the deallocator) and mishandling direct NIO buffers can still cause memory leaks (on the native heap).
Not to mention segfaults. Protecting against native crashes and other issues within the VM shouldn’t be an issue - it’s not really much different to the current housekeeping. OTOH, once the direct buffer has been passed to native code it’s already possible to trigger segfaults with mishandling. I’d rather crashes that easily reproduce than ones at the whim of the garbage collector.
Oh!. Since I had out jitwatch, I just compiled this:
private static void sum(float[] d, float[] a, float[] b)
{
int len = d.length;
for(int i=0; i<len; i++) {
d[i] = a[i]+b[i];
}
}
Annnddd…the top of the unrolled loop looks like:
L0001: movdqu xmm0,xmmword ptr ; Load 4 values from a
0x0000000002abc9b7: movdqu xmm1,xmmword ptr ; Load 4 values from b
0x0000000002abc9be: addps xmm1,xmm0 ; Yo! Add 4 values
and likewise for mul. So there’s at least some basic autovectorization in the current release build.
[quote=“Roquen,post:57,topic:55436”]
Thanks! It helped me identify a few issues with the current implementation:
- Using .slice() is more efficient than .duplicate(). (improves the reflection fallback)
- Using sun.reflect.FieldAccessor directly eliminates some overhead from java.lang.reflect.Field. (improves the reflection fallback)
- Making the JEmalloc instance (that holds the function pointers) final eliminates an indirection and implicit NPE check. (improves all)
- Using Unsafe.allocateInstance eliminates any overhead from slice()/duplicate(). (improves the Unsafe implementation)
With the above changes and after adding NPE checks to testUnsafe (the implicit NPE checks in testLWJGL cannot be removed), both tests get JIT compiled to identical code:
Unsafe: http://pastebin.java-gaming.org/520d9715f3b1a
LWJGL: http://pastebin.java-gaming.org/20d918f5b3a16
Also, JITWatch is awesome. I always wanted to try it and Roquen gave me an excuse. Use it!
[quote=“nsigma,post:51,topic:55436”]
That’s correct, MemoryAccessorJNI will work on any JVM. It will also going to be slow, that’s why there’s an appropriate warning if it ends up being used. How slow? As slow as NewDirectByteBuffer plus the overhead of an extra JNI method call. Why is NewDirectByteBuffer slow? Because it calls a package private DirectByteBuffer constructor reflectively and reflective constructor calls are horribly inefficient. That’s why the MemoryAccessorReflect fallback uses slice() and then sets the appropriate field values via reflection. It’s much faster, but also requires a real object instance and escape analysis can’t do anything to improve that.
The suggestions button is very useful. Like tells you why some basic transforms are not being performed (like method X is too large to be inlined).
EDIT: and another useful and common problem: THIS BRANCH IS RANDOM!! AHHHH!!!