LWJGL jemalloc bindings

Spasi · August 9, 2015, 7:33pm

The latest LWJGL build (3.0.0b #12) includes jemalloc bindings; jemalloc is a general purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support. It is widely used across the industry, read this post from Facebook Engineering for technical details. It is heavily configurable/tunable and includes monitoring/profiling tools.

Why should you care? Mainly because [icode]ByteBuffer.allocateDirect[/icode] is expensive. The benefits of using jemalloc include:

It’s fast and cache friendly.
It scales fantastically well with concurrent allocations.
You can allocate without zeroing out the new memory. This is great if you know that you’re going to write to the whole buffer, right after the allocation.
It minimizes memory fragmentation.
Has advanced features like aligned allocations, efficient reallocations, multiple allocation arenas, thread-local caches, etc.

One drawback is that you have to explicitly free the allocated memory.

I wrote a (synthetic/unrealistic) benchmark to compare it with the JDK allocator. It basically does 10 million buffer allocations/deallocations per thread. Results (all numbers are ns per allocation on a 3.1GHz Sandy Bridge):

// 8 bytes per buffer, 1 thread
     je_malloc :  74ns // no zeroing
     je_calloc :  79ns // zeroing
allocateDirect : 439ns // 5x slower

// 1024 bytes per buffer, 1 thread
     je_malloc :  74ns
     je_calloc : 114ns
allocateDirect : 577ns // 5x slower

// 8 bytes per buffer, 4 threads
     je_malloc :  19ns // awesome scaling
     je_calloc :  21ns
allocateDirect : n/a // took forever and I killed it, process memory at 2.5GB+

// 1024 bytes per buffer, 4 threads
     je_malloc :  19ns
     je_calloc :  31ns
allocateDirect : OOM (direct buffer memory)

// Reduced the workload to 1/10th the allocations, 4 threads
allocateDirect(8)    : 556ns // slower than 1 thread
allocateDirect(1024) : 653ns // with -XX:MaxDirectMemorySize=3g, OOM without

It would be interesting to see how it performs in a real application with lots of ByteBuffer allocations. Please post here if you try it out. I couldn’t provide more interesting data because all my apps go to great lengths to reuse or eliminate ByteBuffer allocations. What’s very interesting about jemalloc is that it’s fast enough to use for temporary/“stack” allocations.

Currently jemalloc comes as an extra dll/so/dylib in the LWJGL distributable. It’s quite a big library (compared to other memory allocators), 105kb - 226kb depending on the OS/arch. If you aren’t interested in using it, just delete the binaries. LWJGL itself may make use of it internally, but I’ll do it conditionally, only if the binaries are available.

Forum admins: should I post LWJGL topics in the Engines, Libraries and Tools board? More often than not, they don’t have anything to do with OpenGL (and Vulkan is coming…). Feel free to move this one too, if you think it’s a good idea.