NIO Memory leak

blahblahblahh · April 9, 2004, 8:31am

I seem to have found a way to make NIO leak 100Mb in about 2-4 seconds. And, usually, that memory is unrecoverable.

Since it’s native memory being leaked, the JVM doesn’t claim to possess it (System.Runtime reports only about 4Mb extra memory used from before the leak to after), although linux betrays the process as having shot from 24Mb to sometimes as high as 220Mb in just a few seconds.

Sometimes, System.gc() will recover up to 25% of the lost RAM. Usually, it will only recover 5% at most.

This has a catastrophic effect on the system. I was wondering if anyone has had any experience with NIO memory leaks? The last memory leak I found was in 1.3 (to do with files) and I’d hoped never to find one again :(. At the moment, I’m struggling to come up with an isolated small test-case: note that I can reproduce the problem 100% of the time BUT it is dependent upon several things that make it damn hard to turn into a test case - including the fact that if you make the same series of requests sequentially, LITTLE OR NO MEMORY IS LEAKED, but if you make them in parallel, with just the right timing, Boom!

FYI I’ve tried different tools to issue parallel requests, and although they functionally all do the same thing, only SOME of them cause the system to explode :(.

blahblahblahh · April 9, 2004, 10:45am

(from Mark’s post in the tools thread, where I referenced this post).

[quote]I presume this is with Memory Mapped Files or direct allocated byte buffers? I was able to leak GB as a result of this:
[/quote]
No MMing AFAIAA, but a combination of direct and indirect BB’s (and I tried converting everything to indirect, but the leak kept happening; so I started working on creating as-simple-as-possible detailed test-case - there may still be a direct BB somewhere, but it’s unlikely).

My first reaction when I saw our test machine die was “bet that’s a direct BB not being recycled” because I recently made a change in one part of the system from re-using one direct BB shared between all channels to having one BB per channel.

Indeed, the general paucity of unmapping control had brought me to that conclusion already :(.

I get quite frustrated that current books and articles on NIO are clearly written by people who’ve never used the darned thing “in anger”. They’ve just read the API docs and written a book; they’ve not encountered all the bugs and inadequacies that are what users really need to know about :(.

mthornton · April 9, 2004, 11:07am

Sorry about my earlier reply going in the wrong thread. I wonder what happens when you use indirect buffers with socket channels. In particular do they allocate system memory and copy your data there before continuing with the I/O.
Otherwise I can’t think of any other parts of NIO that would allocate (and thus potentially lose) memory outside of the Java heap.

blahblahblahh · April 9, 2004, 7:52pm

I seem to have found the problem, and it’s just a simple GC conundrum, but exacerbated by Sun’s incredibly (literally so) memory-hungry implementation of NIO. [note: I haven’t positively verified this yet, but am 98% sure]

What was happening was this:

[] User connects on a port, gets serviced by our server
[] Some clients read the response and fetch the extra resources they’re told to in parallel (and this can be separately simulated by running an HTTP daemon on the server and using a standard web-browser, most of which are happy to open multiple connections to each webserver)
[] Sun’s NIO converts 100k of raw byte data stored in byte-buffers with zero overhead into 150Mb of RAM usage (this is supposition - I haven’t positively proved that its reads or writes that are doing this, but there is very little else to point the finger at; no other code has changed significantly, and there’s almost no other object creation going on)
[] Hence the test machine starts running very low on memory…
[] Inside our gameserver, some of the requests need to connect to a DB to get some data.
[] They connect to a cheap-n-nasty MySQL running on the test machine (artifact of having a dedicated single-machine to run performance tests on - but for this product there are plenty of customers who would probalby deploy like this anyway)
[] MySQL / MySQL’s JDBC driver goes “gasp! not enough memory to service this request! Will hang for 30 seconds!”
[] This locks the threads that are holding references to the allocated NIO buffers for sending that response in-memory whilst MySQL/their JDBC driver shifts its butt around to timing out
[] Because we have parallel requests, more requests are coming in all the time, and so this gets worse
[] Even if you have an interactive shell (as we do) and start frantically invoking “System.gc()” it won’t help you because the threads that are “about to” release their references to the buffers are blocked on the 30 second timeout
[] Machine runs critically low on memory, and being linux it’s a lottery as to whether:
[list]
[] …OS kills the server process
[] …OS runs away crying and crashes the machine
[] (IME, linux really isn’t very good in these situations, and has an astonishing tendency to the do the latter when it could/should have done the former)

[/list]

So, we now have the situation where a MASSIVE usage of memory (> 100Mb) to send a TINY amount of data (<100k) by sun’s NIO (1.4.2_04) can cause a vicious circle that very quickly cripples the machine.

There appears to be no memory leak, as far as I can tell (and I’ve spent a whole day building test suites - that’s another story, though: working around bugs in Apache’s Jmeter :)).

Next step is to try and determine precisely which method calls are being so greedy - my suspicion is that it’s gathering-write, which I already know to be fatally broken. Perhaps, even, some kindly NIO engineer tried to fix my outstanding bugs on GW, and this greedy algo was their patch. That would be truly tragic if so :).

blahblahblahh · April 27, 2004, 1:15pm

Update: 1.5 beta is better (takes approx 50% less memory) but still not perfect.

blahblahblahh · April 27, 2004, 8:18pm

Another update: it seems that the following changes convert the problem from 100% reproducable to 100% non-reproducable:

Convert all buffer allocations from indirect to direct
Convert all “gathering write” channel operations to sequential “write” operations iterating over all the buffers in the gathering-array

BOTH steps are necessary; on their own, neither has any effect whatsoever at all. Both had been tried individually before, but Elias’s comment that he’d fixed a similar problem using 1 above was enough to make me retest 1 individually with every other possible change we’d tried, and 2 turned out to be the magic one.

elias · April 28, 2004, 4:13am

Yes, and then 1. (nondirect->direct) was enough for me because I didn’t use gathering read/writes at all.

elias

cknoll · April 28, 2004, 4:02pm

I found this article at:
http://www.theserverside.com/blogs/showblog.tss?id=DispellingNIOMyths

Perhaps you could take a look at the EmberIO lib (it looks like a NIO wrapper) and see if it solves any of your poblems?

-Chris