Hmm.
int n = alignment - 1;
address = (address + n) & (~n);
vs
address = address & ~(alignment-1);
I count 4 vs 3 instructions. Might make sense to add that.
Hmm.
int n = alignment - 1;
address = (address + n) & (~n);
vs
address = address & ~(alignment-1);
I count 4 vs 3 instructions. Might make sense to add that.
A new build is up (3.0.0 #48).
a) There’s a new Configuration.THREAD_LOCAL_SPACE option (or -Dorg.lwjgl.system.tls), you can use it to set the thread-local implementation. Supported values:
I don’t expect significant, if any, gains from this, but if you try all 3 in a real-world application and have performance figures to share, please post here.
b) The stack implementation has been rewritten. It now has documentation, uses correct terminology and grows “downwards” like a native stack. Also made it static in size, the previous implementation was broken. Use the Configuration.STACK_SIZE option (or -Dorg.lwjgl.system.stackSize) to set the default size. The value is in kilobytes, defaults to 32.
c) I wasted too much time on performance testing again, this time the problem was that struct allocations in tight loops were inexplicably less optimal than buffer allocations.
In both cases I verified that escape analysis eliminated object allocations, but timing and JITWatch showed worse code generation. The problem manifested even when not touching the allocated struct object at all. The problem went away if I used Unsafe.allocateInstance to create the struct object, which is what LWJGL does internally for buffers that are not allocated via ByteBuffer.allocateDirect. After a while I figured out that the problem was final fields in the struct class.
Having any final field in a class (or its super-classes) causes the JVM to emit a memory barrier after the constructor has run. To understand why and the implications of this, see All Fields are Final. Normally this is fine, it’s the desirable behavior in multi-threaded environments and has virtually no impact on ordered architectures (the barrier is a no-op on x86). But:
For example, in code like:
class Data {
int x;
int y;
final int foo;
}
int test() {
Data d = new Data(...);
return d.x + d.y;
}
You always pay the cost for that final “foo” field, even if you don’t use it at all.
Note that, in my tests, the barrier was not an actual CPU instruction (as I said, no-op on x86) but a compiler barrier. That is, all the loop unrolling and loop invariant code motions that I usually see in code that used buffers, was not happening in similar code that used structs. Removing all final modifiers made all the optimizations kick in.
Luckily, there has been a lot of work on JMM-related issues in Java 9 (see Shipilev’s blog post for some examples). I was able to track down a post that mentions this issue, a first attempt to fix it and, finally, the actual fix that made it into Java 9. So, good news is that the unmodified struct code runs with top performance on recent Java 9 builds. But LWJGL has to run on Java 6+, so I’ll have to compromise and trade good Java code for performance. If the fix is back-ported to Java 8, I’ll restore the current code. Initial testing shows the workaround is a win on all Java versions before 9.
@Spasi: impressive bug-hunting
Let’s go one step further
pntr = (pntr >> n) << n; // unset n lowest bits
Well, if you precompute ~(alignment-1), then it’s only one instruction. Fairly sure if you inline the alignment with a constant argument (align(4) for example) then it’ll do that for us anyway, so I don’t think it’s a big deal. xd
I’m at the Khronos Vulkan sessions being streamed right now. By chance are there any concerns / questions to ask any of the bigwigs here pertaining to any present issues w/ current LWJGL support for Vulkan?
We have some fantastic sessions at Moscone, and we’re also holding additional sessions at SF Green Space, a 5-minute walk from Moscone. Take a break from the Moscone crowds and join us for API sessions, complimentary beverages, free reference cards...
Thanks Catharsis. I can’t think of anything though, the Vulkan bindings were straightforward and haven’t had any issues.
MULTI-GPU SUPPORT WHEN
MULTI-GPU SUPPORT WHEN
Let’s just say I just got back from the event and a great event it was… Small plug I’m super stoked that Dan Baker’s (1 of 5 Dan’s in the day) / Oxide’s talk on Vulkan and the slide describing engine architecture fits almost to a T how TyphonRT is structured… And I’m a little tipsy right now… I lingered, almost uncomfortably for myself as an introvert, and a bit toward the seeming end I disappeared to show a demo of my video engine tech to the Kishonti folks who had a wall wart to power my dead phone though were interested in what it takes to make Android / mobile GPUs sweat (yep got that covered). I came back to the main reception room and was near the last one left and was invited to finish off the wine with a core Khronos member, so much to share and say… This was and may be my only contact with Khronos in 13 years of GL development and into the unknown future. Many things were discussed including the difficulties of establishing communication channels to independent voices that fight the good fight despite any direct connect; erm you know who you are in this thread… :o
From the wine fueled discussion and other discussions I had with a core Nvidia driver developer multi-GPU support is primary goal #1 whether it is first exposed as an extension or Vulkan 1.1 ratified spec nonetheless is up for imminent availability.
While final glasses of wine were consumed I shared this thread and got him to bookmark it on phone with said core Khronos member and made it clear that LWJGL is the future of Vulkan for Java. I’ll be so bold in stating that in the after hours of the event when beer started to be consumed and as as a fellow indie dev I imbibed, as well TANSTAFB, nonetheless I poked and prodded anyone from Google. From my understanding from opinions shared (thanks!) there will not be an official Java / SDK binding for Vulkan in Android N, so for LWJGL it’s prime time to spring into support for Android as the defacto standard binding for cross-platform Vulkan support. I haven’t had time to review the initial Android N pre-release to confirm if a full version of sun.misc.Unsafe is present, but that seemingly is the only barrier for LWJGL to run away with official and the only binding to support Android.
Yep… I wished I could offer more support immediately to LWJGL. I gained enough insight today to know that I have to release my video engine effort via GLES 3.1 and no longer wait for Vulkan support; the main crux being extensions for video encode via Vulkan that are not on the present horizon and could be 1-2 years out easily. For the rest of yah though I’d be super stoked to see LWJGL run away with solid cross-platform support. 8) :point: I’m just the latter emoji pointing to y’all…
I was helping out spasi on the lwjgl forums with some travis configuration to help build lwjgl on arm for the raspberry pi (I am abcde there), when I did that I did start playing around getting a travis build working with android ndk. If spasi is interest in getting a lwjgl build for vulcan on android I can help out and finish it off.
Wrote a very simple Java Agent that allows you to use LWJGL 3’s stack allocation without having to create stack frames manually and handle control-flow (returns and exceptions) separately by stackPush/stackPop yourself.
It automatically detects when you want to use stack allocation in a given method (looking for Struct.mallocStack/callocStack and MemoryStack.stackGet) and then provides the stack frame at the method start and frees it at the end of the method, properly handling every control flow (intermittent RETURNs in the method and exception throwing).
There are some caveats:
ok, so I went ahead and completed the travisci config for android anyway. @spasi feel free to include this in to the build when you have time. Like before this is only for jemalloc so should be adaptable to the other projects
.travis.yml
env:
language: c
compiler: arm-linux-androideabi-gcc
before_install:
- export PATH=$PATH:$HOME/.local/bin
- export ANDROID_NDK_VERSION=r11b
- export ANDROID_TOOLCHAIN=arm-linux-androideabi-4.9
- export NDK_PLATFORM=android-9
- echo $ANDROID_NDK_VERSION
- wget http://dl.google.com/android/repository/android-ndk-$ANDROID_NDK_VERSION-linux-x86_64.zip -O ndk.zip
- unzip -qq ./ndk.zip
- export ANDROID_NDK_HOME=`pwd`/android-ndk-$ANDROID_NDK_VERSION
- export PLATFORM_PREFIX=./android-ext/
- mkdir $PLATFORM_PREFIX
- $ANDROID_NDK_HOME/build/tools/make-standalone-toolchain.sh --toolchain=$ANDROID_TOOLCHAIN --platform=$NDK_PLATFORM --install-dir=$PLATFORM_PREFIX
- export PATH=$PLATFORM_PREFIX/bin:$PATH
script:
- ./autogen.sh --with-jemalloc-prefix=je_ --with-malloc-conf=purge:decay --host=arm-linux-androideabi
- make
- cd lib
- ls -alrt
- file libjemalloc.so.2
The configurable bits here are
- export ANDROID_NDK_VERSION=r11b
- export ANDROID_TOOLCHAIN=arm-linux-androideabi-4.9
- export NDK_PLATFORM=android-9
The ndk version is the latest available (the one with vulcan in) and I choose a very low NDK_PLATFORM but this can be changed.
The output from the travici run I did is below
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/base.o src/base.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/bitmap.o src/bitmap.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/chunk.o src/chunk.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/chunk_dss.o src/chunk_dss.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/chunk_mmap.o src/chunk_mmap.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/ckh.o src/ckh.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/ctl.o src/ctl.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/extent.o src/extent.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/hash.o src/hash.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/huge.o src/huge.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/mb.o src/mb.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/mutex.o src/mutex.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/nstime.o src/nstime.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/pages.o src/pages.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/prng.o src/prng.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/prof.o src/prof.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/quarantine.o src/quarantine.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/rtree.o src/rtree.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/stats.o src/stats.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/tcache.o src/tcache.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/ticker.o src/ticker.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/tsd.o src/tsd.c
arm-linux-androideabi-gcc -std=gnu99 -Wall -Werror=declaration-after-statement -pipe -fvisibility=hidden -O3 -funroll-loops -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/util.o src/util.c
arm-linux-androideabi-ar crus lib/libjemalloc.a src/jemalloc.o src/arena.o src/atomic.o src/base.o src/bitmap.o src/chunk.o src/chunk_dss.o src/chunk_mmap.o src/ckh.o src/ctl.o src/extent.o src/hash.o src/huge.o src/mb.o src/mutex.o src/nstime.o src/pages.o src/prng.o src/prof.o src/quarantine.o src/rtree.o src/stats.o src/tcache.o src/ticker.o src/tsd.o src/util.o
arm-linux-androideabi-ar crus lib/libjemalloc_pic.a src/jemalloc.pic.o src/arena.pic.o src/atomic.pic.o src/base.pic.o src/bitmap.pic.o src/chunk.pic.o src/chunk_dss.pic.o src/chunk_mmap.pic.o src/ckh.pic.o src/ctl.pic.o src/extent.pic.o src/hash.pic.o src/huge.pic.o src/mb.pic.o src/mutex.pic.o src/nstime.pic.o src/pages.pic.o src/prng.pic.o src/prof.pic.o src/quarantine.pic.o src/rtree.pic.o src/stats.pic.o src/tcache.pic.o src/ticker.pic.o src/tsd.pic.o src/util.pic.o
The command "make" exited with 0.
$ cd lib
The command "cd lib" exited with 0.
$ ls -alrt
total 1204
drwxrwxr-x 13 travis travis 4096 Mar 17 23:44 ..
-rwxrwxr-x 1 travis travis 400964 Mar 17 23:44 libjemalloc.so.2
lrwxrwxrwx 1 travis travis 16 Mar 17 23:44 libjemalloc.so -> libjemalloc.so.2
-rw-rw-r-- 1 travis travis 412232 Mar 17 23:44 libjemalloc_pic.a
-rw-rw-r-- 1 travis travis 412136 Mar 17 23:44 libjemalloc.a
drwxrwxr-x 2 travis travis 94 Mar 17 23:44 .
The command "ls -alrt" exited with 0.
$ file libjemalloc.so.2
libjemalloc.so.2: ELF 32-bit LSB shared object, ARM, version 1 (SYSV), dynamically linked (uses shared libs), not stripped
The command "file libjemalloc.so.2" exited with 0.
Done. Your build exited with 0.
I am eagerly awaiting LWJGL on the Pi (headless?)
Cas
awaiting LWJGL on the Pi (headless?)
My guess is it requires Pi-specific APIs. Could you please point me to a working sample/tutorial that does headless OpenGL?
Heh, Google-fu:
[EDIT: This post contains information from the early days of the RasPi and may not be that relevant to its current software build. I don’t actually have anything to plug my RasPi into anymore…
Random Hacks – 27 Apr 12
Using OpenGL ES 2.0 on the Raspberry Pi without X windows.
https://jan.newmarch.name/LinuxSound/Diversions/RaspberryPiOpenGL/
Potentially more troublesome is mouse/keyboard support.
Cas
Wrote a very simple Java Agent that allows you to use LWJGL 3’s stack allocation without having to create stack frames manually and handle control-flow (returns and exceptions) separately by stackPush/stackPop yourself.
It automatically detects when you want to use stack allocation in a given method (looking for Struct.mallocStack/callocStack and MemoryStack.stackGet) and then provides the stack frame at the method start and frees it at the end of the method, properly handling every control flow (intermittent RETURNs in the method and exception throwing).
There are some caveats:
- you have to start the JVM with the VM argument “-javaagent:autostack.jar” (this is really annoying in my opinion, but unavoidable unless you want to have a custom build/compile step)
- it really is a 1:1 mapping between the lifecycle of Java methods/stack frames and MemoryStack stack frames, so if you decide to wrap/delegate stack allocation of structs to some other method that expect some stack to be setup by the caller which survives the callee, you’re out of luck
![]()
Hmm. Since this essentially modifies the bytecode before passing it to the JIT compiler, would it be possible to permanently preprocess a jar-file with the agent so that it doesn’t have to be done at runtime?
I am eagerly awaiting LWJGL on the Pi (headless?)
Any idea what the (longer term) benefit would be to having it headless?
Not having to boot into X11 with all its attendant guff and wastage (and delay).
As it happens right now I’m in the commercial embedded device world and having a Pi boot into a usable touchscreen GUI in less than 20 seconds from cold would be a massive win for me.
Cas
@princec - ah, good point! I was more wondering what the runtime performance differences might be between the OpenGL ES headless mode and the full OpenGL X11 one (having read recently that it’s now in public beta) Wasn’t thinking about boot time.
And of course, memory’s a bit tight on a Pi, you don’t want to waste any of it on X11 if you can help it
Cas
Hmm. Since this essentially modifies the bytecode before passing it to the JIT compiler, would it be possible to permanently preprocess a jar-file with the agent so that it doesn’t have to be done at runtime?
That’s what I meant with “you have to start the JVM with the VM argument “-javaagent:autostack.jar” (this is really annoying in my opinion, but unavoidable unless you want to have a custom build/compile step)”
I’ll also provide an ant task and a Maven plugin to do that in a custom build step.