LWJGL jemalloc bindings

Spasi · August 9, 2015, 7:33pm

The latest LWJGL build (3.0.0b #12) includes jemalloc bindings; jemalloc is a general purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support. It is widely used across the industry, read this post from Facebook Engineering for technical details. It is heavily configurable/tunable and includes monitoring/profiling tools.

Why should you care? Mainly because [icode]ByteBuffer.allocateDirect[/icode] is expensive. The benefits of using jemalloc include:

It’s fast and cache friendly.
It scales fantastically well with concurrent allocations.
You can allocate without zeroing out the new memory. This is great if you know that you’re going to write to the whole buffer, right after the allocation.
It minimizes memory fragmentation.
Has advanced features like aligned allocations, efficient reallocations, multiple allocation arenas, thread-local caches, etc.

One drawback is that you have to explicitly free the allocated memory.

I wrote a (synthetic/unrealistic) benchmark to compare it with the JDK allocator. It basically does 10 million buffer allocations/deallocations per thread. Results (all numbers are ns per allocation on a 3.1GHz Sandy Bridge):

// 8 bytes per buffer, 1 thread
     je_malloc :  74ns // no zeroing
     je_calloc :  79ns // zeroing
allocateDirect : 439ns // 5x slower

// 1024 bytes per buffer, 1 thread
     je_malloc :  74ns
     je_calloc : 114ns
allocateDirect : 577ns // 5x slower

// 8 bytes per buffer, 4 threads
     je_malloc :  19ns // awesome scaling
     je_calloc :  21ns
allocateDirect : n/a // took forever and I killed it, process memory at 2.5GB+

// 1024 bytes per buffer, 4 threads
     je_malloc :  19ns
     je_calloc :  31ns
allocateDirect : OOM (direct buffer memory)

// Reduced the workload to 1/10th the allocations, 4 threads
allocateDirect(8)    : 556ns // slower than 1 thread
allocateDirect(1024) : 653ns // with -XX:MaxDirectMemorySize=3g, OOM without

It would be interesting to see how it performs in a real application with lots of ByteBuffer allocations. Please post here if you try it out. I couldn’t provide more interesting data because all my apps go to great lengths to reuse or eliminate ByteBuffer allocations. What’s very interesting about jemalloc is that it’s fast enough to use for temporary/“stack” allocations.

Currently jemalloc comes as an extra dll/so/dylib in the LWJGL distributable. It’s quite a big library (compared to other memory allocators), 105kb - 226kb depending on the OS/arch. If you aren’t interested in using it, just delete the binaries. LWJGL itself may make use of it internally, but I’ll do it conditionally, only if the binaries are available.

Forum admins: should I post LWJGL topics in the Engines, Libraries and Tools board? More often than not, they don’t have anything to do with OpenGL (and Vulkan is coming…). Feel free to move this one too, if you think it’s a good idea.

KaiHH · August 9, 2015, 8:14pm

Awesome!
Could you give a short snippet (or a reference to some code) of how to use it?
Will the BufferUtils.createByteBuffer method make use of it?

Spasi · August 9, 2015, 8:34pm

[quote=“KaiHH,post:2,topic:55260”]
Sure, it’s quite simple:

import static org.lwjgl.system.jemalloc.JEmalloc.*;
// ...
ByteBuffer buffer = je_malloc(1024);
FloatBuffer bones = je_calloc(80, 4 * 3 * 4).asFloatBuffer(); // 80 mat4x3
// ...
je_free(bones);
je_free(buffer);

This is just the basic usage. There are many other (standard and non-standard) methods and of course the unsafe/long versions that LWJGL generates.

[quote=“KaiHH,post:2,topic:55260”]
Yes, I’ll work on that soon. It’s not trivial because I want to make it optional and using jemalloc requires explicit je_free calls to avoid leaking memory. Existing usages of BufferUtils do not have that requirement and will have to be adjusted accordingly.

Anyway, I haven’t given it much thought yet. Today I wasted all my time porting the LWJGL CI travis scripts, from the old workers to the new container workers. So fast, so awesome!

abcdef · August 9, 2015, 9:21pm

Spasi

Hi, do you know when the javadoc will be up for this new set of bindings? I don’t immediately see a use case for this in a typical opengl scenario because the render cycle shouldn’t be creating ByteBuffer’s ideally and this is where the time sensitive processing happens. But in principal this sounds like something that could shave off time from a performance perspective for other use cases.

I’d be interested to hear where you think this could be used in LWJGL itself.

Spasi · August 9, 2015, 11:43pm

[quote=“ziozio,post:4,topic:55260”]
It’s up now at http://javadoc.lwjgl.org/. The docs are uploaded at the same time as the build, but it takes a while for the index to update because of the CDN cache. I do not flush it for nightly builds.

[quote=“ziozio,post:4,topic:55260”]
BufferUtils would be the most useful candidate, because it’s used by everything else. But as I said, this can only be an opt-in. Other stuff that allocate:

Callbacks. This should be easy to handle, they already have a destroy method.
Structs. Some APIs use them a lot (Vulkan will too if it looks like Mantle) and it would be nice not having to worry about allocation overhead. This should be painless too. Structs currently can be used with either a ByteBuffer+static API or a typed struct instance (that wraps a ByteBuffer)+instance API. The struct class could be made Retainable and instances could allocate with jemalloc.
Functions that encode CharSequences allocate (e.g. glShaderSource). These are just temporary ByteBuffers, they could be allocated and freed with jemalloc.
Functions that decode Strings use the APIBuffer, which is an internal LWJGL class that provides temporary thread-local “stack” storage. It’s super fast (faster than jemalloc), but if the string size is too big, it “stretches” the “stack” and that memory is never reclaimed. We could use jemalloc here too, speed is not an issue with such functions.

Spasi · August 22, 2015, 11:18pm

[quote=“KaiHH,post:2,topic:55260”]
The latest build (3.0.0b #19) includes an explicit memory management API in MemoryUtil. BufferUtils has not changed, works as before with GCed off-heap memory. Functions supported:

memAlloc, memAlloc
memCalloc, memCalloc
memRealloc
memFree
memAlignedAlloc
memAlignedFree

These map to the corresponding jemalloc functions by default. If jemalloc is not available, the standard stdlib.h functions will be used (memAlignAlloc maps to _aligned_malloc on Windows and posix_memalign on Linux/OS X).

LWJGL uses these functions internally, where it makes sense. Build #19 also includes important fixes and performance improvements.

Mike · September 14, 2015, 8:29pm

Are the buffers freed when the java process terminates or should it be handled by the application whenever the application, for example, closes unexpectedly?

Spasi · September 14, 2015, 9:03pm

All memory allocated within a process is freed when the process is terminated, by definition. You don’t need to do anything special when the application crashes.

KaiHH · September 14, 2015, 9:04pm

Just like the JVM, jemalloc also allocates virtual memory using platform-specific APIs (VirtualAlloc on Windows and mmap on linux), so the memory is all managed by the operating system and it keeps track of all allocated virtual memory pages of a process.
When the process dies the operating system frees the allocations (i.e. mapping from process to virtual memory page) so that the memory can be allocated by other processes.
There are also circumstances in which a user-space application cannot react to it being killed (via SIGKILL) signal (equivalent to killing a process in Windows via the Task Manager or via the “red button” in Eclipse’s Console View ), so there would be no way to manually deallocate the memory. Also JVM shutdown hooks via Runtime.addShutdownHook(…) would not help here as those only execute on SIGHUP, SIGINT, SIGTERM and SIGQUIT.

Icecore · September 15, 2015, 6:13am

What ???
What do with such memory?)

theagentd · September 15, 2015, 6:40am

As Spasi said, when a process is killed all memory it used is deallocated automatically.

Spasi · September 19, 2015, 9:58pm

The latest nightly build (3.0.0b #28) includes a debug allocator, enabled with [icode]-Dorg.lwjgl.util.DebugAllocator=true[/icode]. When enabled, it will report any memory leaks on JVM exit (including stacktrace of where a leaked allocation occured). Note that it tracks only allocations made with the explicit memory management API in MemoryUtil, mentioned above.

There is also an experimental API for reporting current memory usage (see the memReport methods in MemoryUtil).

theagentd · December 21, 2015, 2:14am

Resurrecting this thread.

I’m writing a new batch system for rendering and need to store vertex data in buffers. I have a clever idea of how to do this and it involves allocating “pages” of memory. At first I thought that I’d just cache the “page” ByteBuffers in a massive list, but I’ll be retrieving these pages in a concurrent manner, so I’d need to deal with thread safety. I realized that JEmalloc can handle this perfectly for me, but there’s a small quirk that bothers me a lot. JEmalloc generates a lot of garbage ByteBuffer objects that I can’t get rid off. Would it be possible to eliminate this?

Spasi · December 21, 2015, 8:28am

Yes. There are unsafe versions of the jemalloc functions, as well as the allocator-agnostic API in MemoryUtil. Those work with raw pointer values (long) instead of wrapping them in ByteBuffer instances.

However, keep in mind that LWJGL creates ByteBuffer instances in such a way, that lets the JVM eliminate allocations via escape analysis. Hot methods that allocate on enter and free on exit will almost always produce no garbage and using the unsafe API won’t make a difference.

theagentd · December 21, 2015, 1:35pm

I will be allocating buffers with a lifetime over a single frame, so escape analysis won’t help me. I already did tests and confirmed that I get around 1MB of garbage per second. After posting this I did indeed find the native versions of JEmalloc and MemoryUtil and did some simple tests. Looks like those will work well for me, so I think I’ll roll with it. Thanks a lot!

Spasi · December 21, 2015, 6:12pm

The idea is to pass longs across inline boundaries and use ByteBuffers for safety/convenience inside. An inline boundary is anything that breaks inlining (and consequently, escape analysis), which includes: the caller is too big, the callee is too big, the callee is too deep, etc. This is application-specific and requires some analysis (use -XX:BCEATraceLevel=3), but in most cases a trivial refactoring is enough to fix problematic methods. Anyway, the point is that it doesn’t have to be a single method that does this, the code may call other methods (and pass ByteBuffer instances to them) and escape analysis can still work.

Here’s a simple example that shows what I mean. You want to go from:

void root() {
	ByteBuffer buffer = memAlloc(1024);

	methodA(buffer); // buffer instance escapes
	methodThatEventuallyCallsMethodB(buffer); // buffer instance escapes

	memFree(buffer);
}

// not inlineable because too big
void methodA(ByteBuffer buffer) {
	// do stuff
}

// not inlineable because too deep
void methodB(ByteBuffer buffer) {
	// do stuff
}

to:

void root() {
	int capacity = 1024;
	long address = nmemAlloc(capacity);

	methodA(address, capacity);
	methodThatEventuallyCallsMethodB(address, capacity);

	nmemFree(address);
}

// not inlineable because too big
void methodA(long address, int capacity) {
	ByteBuffer buffer = memByteBuffer(address, capacity); // escape analysis eliminates the buffer instance
	// do stuff
}

// not inlineable because too deep
void methodB(long address, int capacity) {
	ByteBuffer buffer = memByteBuffer(address, capacity); // escape analysis eliminates the buffer instance
	// do stuff
}

The root method could of course be some code that stores the allocated memory block in a data structure, for future reference across frames, etc. When testing this, keep in mind that you may get some garbage at first, until the methods are hot enough and the JIT kicks-in.

Methods in MemoryUtil like memByteBuffer, as well as anything that instantiates Struct/StructBuffer classes in LWJGL 3, have been tuned to enable this. This work was done between the alpha and beta releases and has been tested extensively (e.g. with an experimental Java backend for NanoVG). The JVM does a fantastic job when you follow the rules and I think it’s the best we can have until we get value types in Java 10.