I seem to have found the problem, and it’s just a simple GC conundrum, but exacerbated by Sun’s incredibly (literally so) memory-hungry implementation of NIO. [note: I haven’t positively verified this yet, but am 98% sure]
What was happening was this:
[] User connects on a port, gets serviced by our server
[] Some clients read the response and fetch the extra resources they’re told to in parallel (and this can be separately simulated by running an HTTP daemon on the server and using a standard web-browser, most of which are happy to open multiple connections to each webserver)
[] Sun’s NIO converts 100k of raw byte data stored in byte-buffers with zero overhead into 150Mb of RAM usage (this is supposition - I haven’t positively proved that its reads or writes that are doing this, but there is very little else to point the finger at; no other code has changed significantly, and there’s almost no other object creation going on)
[] Hence the test machine starts running very low on memory…
[] Inside our gameserver, some of the requests need to connect to a DB to get some data.
[] They connect to a cheap-n-nasty MySQL running on the test machine (artifact of having a dedicated single-machine to run performance tests on - but for this product there are plenty of customers who would probalby deploy like this anyway)
[] MySQL / MySQL’s JDBC driver goes “gasp! not enough memory to service this request! Will hang for 30 seconds!”
[] This locks the threads that are holding references to the allocated NIO buffers for sending that response in-memory whilst MySQL/their JDBC driver shifts its butt around to timing out
[] Because we have parallel requests, more requests are coming in all the time, and so this gets worse
[] Even if you have an interactive shell (as we do) and start frantically invoking “System.gc()” it won’t help you because the threads that are “about to” release their references to the buffers are blocked on the 30 second timeout
[] Machine runs critically low on memory, and being linux it’s a lottery as to whether:
[list]
[] …OS kills the server process
[] …OS runs away crying and crashes the machine
[] (IME, linux really isn’t very good in these situations, and has an astonishing tendency to the do the latter when it could/should have done the former)
[/list]
So, we now have the situation where a MASSIVE usage of memory (> 100Mb) to send a TINY amount of data (<100k) by sun’s NIO (1.4.2_04) can cause a vicious circle that very quickly cripples the machine.
There appears to be no memory leak, as far as I can tell (and I’ve spent a whole day building test suites - that’s another story, though: working around bugs in Apache’s Jmeter :)).
Next step is to try and determine precisely which method calls are being so greedy - my suspicion is that it’s gathering-write, which I already know to be fatally broken. Perhaps, even, some kindly NIO engineer tried to fix my outstanding bugs on GW, and this greedy algo was their patch. That would be truly tragic if so :).