Extremely poor performance with NIO...

From the thread: http://www.JavaGaming.org/cgi-bin/JGNetForums/YaBB.cgi?board=Networking;action=display;num=1052749856;start=45#45

Arrgh. I think I’ve just encountered the same problem in a different server. I’ve got a semi-production server (which for a month has been running a game OK) which has been fine with three Selectors. Performance has been good.

I just added a new processing module, and suddenly the performance on the Write Selector has gone through the floor and is busily tunnelling it’s way to Australia.

First response takes 3 seconds (!) to send (it’s only 63k, on a LAN!). If the request is repeated, the response takes 60 seconds (!) sending almost exactly the same data. If you try a third time, linux crashes - or sometimes, you get a lot of disk thrasing, linux kills the JVM process, doesn’t crash [phew!] and you are left with 12 zombie JVM processes sharing 200 Mb of RAM - EACH process has to be force killed (signal 9) individually; note: normally in linux, killing one JVM process kills the lot, because the other processes are just the different threads within the JVM.

Normally, typical responses of the order of a few kb are transferred at LAN speeds (multiple Mbits, i.e. you’d expect perhaps 10 milliseconds for the transfer, not multiple seconds).

Profiling shows nothing - no CPU hogs, no memory hogs. This seems therfore to be a problem with the JDK 1.4.2 implementation of Selector’s and/or ByteBuffer’s (this is the only way I can think of that RAM usage and/or CPU usage are taking place but are invisible within the JVM?).

Interestingly, none of my base code has changed - all my processing is abstracted away from the Selector’s (I just add a new module that returns String’s to the lower-layer which sends them etc), so the only thing that is different between when I had no problems and now is the length of the response and the actual bytes involved.

Interesting points:

  1. My SO_SENDBUF is currently approx 25k. Previous responses have typically been 10k-20k (although I have NOT measured this for all of them); it is possible that this is the first time I’m going over the size of the default send buffer.
  2. I’ve been making excessive use of the convenience method in nio.charset:

ByteBuffer bb = ((Charset) ch).encode( String );

…unfortunately, the nio team didn’t document what kind of BB that code generates (direct, indirect, etc?) and any performance considerations. This is the one of the few “unknown quantity”'s in my code, so I’m treating it as potentially suspicious.

Could you try experimenting with limiting your requests to slightly less than the SO_SENDBUF? (have a look at Socket.getSendBufferSize() ). You can also change the size of SO_SENDBUF, but that may have further unpredictable effects, so I would suggest that you first of all just limit the amount of data you’re sending…

If that has no effect, then are you making similar usage of Charset’s?

Using almost precisely the same ByteBuffer’s (generated by the same algorithm filled with 63k data, etc):

  1. I altered the position() of each buffer from the tail forwards so that the gathering-write would only process the first 20k of data (well within the 25k of my send buffer)
  2. First request STILL takes 3 seconds. The amount of data has reduced by a factor of 3, the only thing that is constant is the number of BB’s and their contents
  3. Successive requests now also take 3 seconds (variation of ± 20%)
  4. Re-enabling the full 63k of data, I noted that the individual BB’s in the gathering write are delivering their content VERY fast (too fast for a human to see), but there is an inter-BB delay of up to 5 seconds (demonstrated by opening a telnet client, and watching the data appear in chunks exactly correlating to the contents of successive BB’s).

The implications of this might be one or more of:

  1. GatheringByteChannel.write( ) is extremely inefficient for large numbers of BB’s in the array - I’m using approximately 60 BB’s. This should be documented if this is true and deliberate, but there is no mention of it in the JDK docs at the moment. However, this could be a bug due to the following point:

  2. Sun’s implementation of GatheringByteChannel in SocketChannel is very poor (for JDK 1.4.2). GatheringByteChannel states that:

“Unless otherwise specified, a write operation will return only after writing all of the r requested bytes. Some types of channels, depending upon their state, may write only some of the bytes or possibly none at all. A socket channel in non-blocking mode, for example, cannot write any more bytes than are free in the socket’s output buffer.”

In practice, Sun’s implementation appears to only write one BB’s worth at a time. This can be very very much less that what the socket’s output buffer could hold (in my current example, approximately 2% of the size!).

This may indirectly be the cause of the performance problems (perhaps the overhead of doing 60 select() operations in quick succession is causing the massive delays?).

I wrote a simple test case that makes it easy to experiment with these behaviours.

Download it from:

http://www.grexengine.com/sections/people/adam/NioTester.class

…or the source, if you want it:

http://www.grexengine.com/sections/people/adam/NioTester.java

Just try running it (it’s only one class file) and it will print instructions.

I’d be interested to hear if other people’s experience differs from my own…particularly on non-linux platforms.

I find that with 1 or very few buffers (less than 5) performance is what I’d expect given I’m going from localhost to localhost. With 100 buffers, the performance drops to around 50kb/s, which is pathetic even for a 10Mbit LAN, let alone local machine.

I’m still hoping I’ve made a stupid mistake here, but I wrote this test case from scratch and I’m still seeing this bad behaviour.

With the test case, I always get 16 buffers processed per WRITE operation, whether I use 100 buffers or 200. 16 is suspiciously power-of-2, so perhaps this is some “preferred” number of buffers to send at once.

With a large number of buffers (e.g. 100) it still sends 16 buffers at once, even though that’s more data than my send-buffer can handle (approximately 32k for a 25k buffer)

I tried your program on Mac OS X 10.2.6 with Java 1.4.1 DP 102

10 buffers = 5.9 MB/s to 17.6 MB/s
100 buffers = 1654 B/s to 39.8 kB/s
200 buffers = anywhere from 170 B/s to 10kB/s

In general the first few parts would take the most time (up to 200 ms each) and the remaining parts were much faster (under 10 ms each)

win2k

10 buffers 1.1 MB/s
100 buffers 0.031 MB/s
1000 buffers 0.000904 MB/s

redhat 9:

10: 8.79 mb/s
100: 20.2 kb/s
200: 5.29 kb/s

hmmmm, maybe I did something wrong ::slight_smile:

1 - 0ms div/0
10 - 0ms div/0
100 - IOException: Vector write failed
1000 - IOException: Vector write failed

or maybe its cos I was using Win98 :wink:

just did 2 more tests :stuck_out_tongue:

16 - 0ms div/0
17 - IOException: Vector write failed

magic 16 again :o

[quote]hmmmm, maybe I did something wrong ::slight_smile:

1 - 0ms div/0
10 - 0ms div/0
100 - IOException: Vector write failed
1000 - IOException: Vector write failed
[/quote]
Yeah, sorry about the div/0. It was only a quick-n-dirty test case :).

It would be nice to know what that IOE means; it doesn’t seem to be matched by google (are you summarising a longer description?).

I’ve also now got more results from WinXP systems:
1, 10 = 0ms
100, 200 = approx 12Mb/s (which would be max LAN speed; 100Mbit LAN)
1000 = approx 40 MB/s

Those results are the same whether the client telnet is from the local machine or another one over the LAN. It is clear from the telnet clients that WinXP is NOT slowing down throughout the transfer (unlike linux/OS X and win2k). I can’t remember whether win98 even has real OS-level asynch I/O or not (windows completion ports) ? I’m surprised that win2k is doing badly here.

Unfortunately, telnet clients are ultimately a poor choice for this test because they can’t reliably say much about the actual transfer speed (note: the testcase only measures the “apparent write-to-network-card speed” or similar).

My original problem is two-fold: both the apparent-write-speed and the actual-transfer speed are extremely low. I want to check that WinXP (and 98?) aren’t just experiencing the same problem but reporting it differently :). So, here’s a (very simple) test client with source code, that does NOT output what it gets back to screen (System.out.println being notoriously bottlenecking…).

NOTE: Server HAS changed slightly, so you need to download both class files (or source files, if you prefer)

http://www.grexengine.com/sections/people/adam/NioTester.class
http://www.grexengine.com/sections/people/adam/NioTesterClient.class

http://www.grexengine.com/sections/people/adam/NioTester.java
http://www.grexengine.com/sections/people/adam/NioTesterClient.java

As before, both programs come with their own instructions.

10 17.8MB/s
100 0.025MB/s
1000 0.000936MB/s

The thousand buffer test generated, according to the task manager, 250MB of garbage. So take that with a grain of salt since about 3/4 of the way through the test my cache started thrashing like mad.

heres the exact error msg, it looks like you are catching an excpetion, then rethrowing it.

Trying the new version now…

Got pretty much the same.

With 1 buffer :-

With 10 buffers :-

With 16 buffers :-

With 17+ buffers :-

The latency was variable, normally 50 or 60ms, occasionally 100-200, and once or twice it took 1000-2000.

[quote]Got pretty much the same.
[/quote]
Apparently completion ports (windows only decent implementation of Asynch I/O) is only available on NT, 2k, and XP; this may be 50% of your specific problem.

The message I’m printing out is merely Exception.toString(), so that’s the error message that your JVM is deciding to create. ISTR that windows has very low hard-coded limits on various aspects of it’s networking - I’m guessing that win98 is hard-coded to 16 buffers. There was a bug in JDK 1.4.0 IIRC whereby no windows implementation could do more than 64 sockets at once, or something, because one of these hard-limits? But I’m almost totally ignorant of the implementation details of windows I/O, so don’t assume I’m right :).

ISTR also that these values can be changed, something like in the registry or via system calls to reconfigure the limit (just like the good old days of having to specify at boot-time the maximum number of file-handles in your autoexec.bat / config.sys :wink: ).

[quote]10 17.8MB/s
100 0.025MB/s
1000 0.000936MB/s

The thousand buffer test generated, according to the task manager, 250MB of garbage. So take that with a grain of salt since about 3/4 of the way through the test my cache started thrashing like mad.
[/quote]
Thanks for that detail. That mirrors the worst of my experiences with linux, and it’s useful to know it happened on win2k too, although I’ve not yet managed to cause such excessive RAM usage with the simple test case. 1000 buffers use 125Mb according to linux. There must be a bug in my test case somewhere, because either way that sounds an insane amount of RAM usage. There have been some bugs in Sun’s JVM’s for years where it fails to GC I/O object-refs that have gone out of scope, so I’ve changed the server to explicitly null most references just before they go out of scope. This may or may not help (makes no difference in linux).

I’m afraid I’ve also spotted a bug in the server - it was artificially reporting much lower speeds than it should (for the speed calculation, I was using the most-recent number of bytes written, rather than total number).

I’ve fixed this, and added a line to summarise the total number of bytes actually sent, so you can easily verify the speed yourself :).

Re-testing on my machines, the speeds are somewhat better - but still 100’s of times slower than they should be.

Does anyone have any explanation for the behaviour we’re seeing here, other than the possibility of a bug in NIO?

I know nothing about NIO, but I have encountered input/output streams being the source of memory leaks b4 as well.
It wasn’t in J2SE though, it was J2ME MIDP on the Siemens mobiles. If you lost your reference to an input/output stream before you called close() on it, it never got gc()'ed :S

Unlikely to be the problem your encountering, but it was related… kinda :wink: [ok, maybe not, but still something worth knowing if u ever branch into J2ME :D]

[quote]Does anyone have any explanation for the behaviour we’re seeing here, other than the possibility of a bug in NIO?
[/quote]
Why don’t you ask the guy who “wrote the book”? The 3rd section of his site is “Gotta Question” and he was friendly enough to exchange a series of 4 emails with me as I explained and described my problem and two of those emails were before I had bought the book.

Yep, I had the same with J2SE 1.3.x. I only noticed because I had an application that ran for weeks at a time, reading and writing to thousands of files an hour (checkpointing).

Thanks, leknor, that was a good idea. Ron sensibly suggested doing a direct comparison with a manual implementation of iterating over the buffers and reading from each of them.

I’ve tried this, and now it seems that you can achieve almost 100% speed from doing this, instead of using the (supposedly faster!) gathering write! I’ve updated the server (NioTester) again, and it now takes an extra argument: either 1 which works exactly as before, or 2 which does:


      long lastWrite = -1l;
      for( int k=0; k<responseBuffers.length; k++)
      {
            lastWrite = ((SocketChannel) key.channel() ).write( responseBuffers[k] );
            //System.out.println ( "    MODE2: Wrote "+lastWrite+" bytes to buffer[ "+k+"]" );
            if( lastWrite < 0 )
                  break;
            bytesWritten = bytesWritten + lastWrite;
      }

This is not perfect: I’m afraid it may distort the readiness notification, and the interleaving of writes to different clients (because instead of doing one write to all buffers, it does consecutive writes to each buffer).

If you care to try this version, and compare the two different modes, I think you’ll find that mode 2 works quite nicely:

(for 1000 buffers, mode 2)

BYTES TRANSFERRED: 1,956,986
SPEED (estimated): 2,942,836 bytes per second.

BYTES TRANSFERRED: 1,953,989
SPEED (estimated): 2,594,938 bytes per second.

(for 1000 buffers, mode 1)

BYTES TRANSFERRED: 1,956,986
SPEED (estimated): 89,392 bytes per second.

BYTES TRANSFERRED: 1,952,990
SPEED (estimated): 114,430 bytes per second.

I’ve submitted a bug-report, and included a pointer to this forum/topic. From current experience (I just got a bug confirmed yesterday), it’s taking many weeks to get bugs processed, so it could easily be a month before it gets turned into an “official” bug.

In the meantime, I’m going to use the workaround in my production server, and see if it holds up when loaded.

blah blah,

how has the workaround held up so far.

I am thinking of putting a bug report in for the problem i was seeing with multiple clients, and so revisited your test.

What java version were you running? I saw a VERY significant difference when running the exact same numbers between java 1.4.1_01 and java 1.4.2b.

Just wondering if something may have changed between the version. It runs much faster with 1.4.1_01