Should you risk using NIO for hard-core networking

Oops. Too much code to post. I did a quick ref for it at:

http://users.adelphia.net/~dfellars1/NIOSelectorCode.html

Sorry about that. Please check it out there.

Thanks again for any help.

That is correct.

Yes, I have ran them on seperate machines on different network, I have ran them on the same network, as well as on the same machine.

I have tried this with somewhat similar results.

I will have to get together again with my admin. They wont allow me to run any network monitoring myself so I have to bug them to get them to do it.

I originally had it designed this way. but was worried that my issue was MT, so went to a single selector approach. Since going to a single selector, I actually saw it speed up, thus implying there may have been a MT issue. although I still cant see where.

That is what I am hoping for as well.

Thanks

Well, two more things. Firstly, I have seen something like this before, but I can’t remember what the problem was, other than that someone had made a stupid (and subtle) mistake in the implementation…will see if I can find a changelog for the app where I think it happened. IIRC, it was a buffer that was being abused (not being cleared properly, or similar) or in fact that data was not fully being read from buffers. It was surprising that the thing worked at all, once you saw the mistake - but it did, just at 15% the expected speed, and (only at very high load) dropping requests (which should have been impossible).

Having compiled your code, I get the following behaviour:

Start with 1 client and 1 server on the same machine
2000 instances per client, 10 trips each, sending 1000 bytes.

The observed behaviour:

  • the sequential numbers (which I assume are the clients connecting, havent’ read the source yet) go up, but pause at about 150, 250, 450, 500, for up to 10 seconds each. (the points at which it pauses are clearly random)
  • once everything is connected, it all goes a bit slowly, and then I hit the “too many open sockets” limit (on linux; don’t have root access on that machine).

But the first time I changed ONLY the “2000 client instances” down to 500, then ALL the pauses that occurred prior to 500 evaporate. This suggested there is at least one problem caused by having too many client instances going on on one machine…

Sadly, repeated runs (without changing ANYTHING) show pauses at about 100, 300, etc.

One final thought - on some systems, java select might be implemented using the inefficient OS primitive which iterates across the ENTIRE table of socket descriptors, rather than just those that have altered state. However, given that you had the same behaviour on linux and windows, this is probably irrelevant; I assume you are using win2k or xp, both of which use reasonably good OS IO.

[quote]Why did this thread die?

I essentially cleaned down my sample code I was working with so that there is only 1 thread with one selector for all 3 operations: OP_CONNECT, OP_READ, OP_WRITE. I know this is not the best approach, but wanted to remove any chance of having MT issues. The following code was adapted from a Sun developer example and modified to keep track of how long it takes for each operation.

The general flow is the following:

  1. The Server starts up and listens for connecting sockets.
  2. Once the predetermined number of connections have connected to the server, the server then sends 1 byte to all the connections telling them to send their payload. This is the initialization phase. After sending the bytes, the connections selection key interest ops is set to read.
  3. The clients upon reading this one byte from the server, then switch their interest ops to write, and send their payload to the server. This is the clients write phase. The clients then switch interest ops back to read to be ready for the response from server.
  4. The server then reads the payload for all the clients. This is the servers read phase.
  5. When all bytes for a client connection have been read, the server immediately switches that connections selection key interest ops to write and writes back all the bytes received(simple echo). This is the servers write phase.
  6. The client then reads all the bytes sent back from the server. This is the clients read phase.
  7. On the server, once all bytes have been read for all client connections, the server repeats the initialization phase (#2 above ) all over again to repeat the process up to the number of trips pre-determined.
    [/quote]
    Ahem. We can also add:

1.5, 4.5: Server sleeps for 10 milliseconds if the number of keys returned from select == 0.


nNumKeys = sel.selectNow(  );

            if( nNumKeys <= 0 )
            {
                thread.sleep( 10 );
            }

That’s a pretty likely culprit! I added a sys.out.print at that point, and found that the sleep is being called hundreds if not THOUSANDS of times.

Bug found? :slight_smile: Perhaps…

Please forgive me for my stupidity. This is one of those “I could have sworn I changed that” moments. I dont know if you remember one of my previous post where I mentioned that when I performed a sleep after a selectNow call it speed things up, which made you blahblahblah suggest an MT issue, so that is why i made it all one thread and thought it put it back to using the select() method by passing true to that method. But I obviously didnt. Sorry. I have updated the code on my webserver and will re-run again with new code to see if it speeds things up.

Thanks , and again I am sorry for that.

[quote]Please forgive me for my stupidity.
[/quote]
No worries; it seems to be one of the big problems with NBIO server-development that it’s really easy to make subtly stupid mistakes that are never quite show-stoppers, and so they’re hard to discover. I’ve made similar mistakes in NBIO code a couple of times ;).

EDIT: I only say this because NBIO is particularly difficult to spot problems, and using encapsulation etc can be much more helpful than normal - and I expect you’ll continue to run into hard-to-trace problems as you continue to modify your test app.

However, your monolithic structure (e.g. 150+ lines of code in your server run method!) leaves a lot to be desired; it would be much easier to understand and scan your code for possible problems if you split it up more. I notice there are lots of methods, but often for only a few lines of code each; I’d suggest a method each for handling acceptable, readable, and writable keys - and help if you decide to split those functions out.

A split into separate classes would also help.

Only because you’re having strange problems, I’d also suggest commenting out all the fancy, exotic stuff. E.g. settting socket receive buffer sizes, tcpNoDelay, etc. The NIO API’s are currently under-tested by Sun, and it’s been quite easy in 1.4.0 and even 1.4.1 to break them by doing anything exotic (for the 1.4.x series, Sun appeared to lack anyone in the NIO team who understood unit and system testing - basic stuff that should have been automatically tested in unit tests went unfixed for both .0 and .1 releases).

Im back,

I cleaned up the code to make it more legible and it now uses the select() method instead of sleeping. I am still seeing slower-than-expected results with the sample code and would like more feedback if any.

the code is found at: http://users.adelphia.net/~dfellars1/NIOSelectorCode.html

Also, on a different NIO related topic, I have 2 servers communicating with eachother over a socket that remains open the entire time the servers are up. Up until recently I was running the servers on the same network, but recently moved one of them to a different network and am getting the following situation:

The connection is getting dropped for whatever reason, which is to be expected. However, the client that writes to the dropped connection does not throw any exception when writing. It writes all the bytes to the channel and returns as if everything is ok. Then maybe a minute later I will receive a read of -1 specifying the connection was dropped, but the writen message was never received on the server.

What I would like to be able to do is determine that the connection is dropped before writing to it, so that I can reconnect and then write to the channel. Is there an easy way to detect a dropped connection? I am already using isOpen() and key.isValid() which are returning true each time. I have also tried socket.setKeepAlive(true) with no success.

Should the SocketChannel.isConnected() tell whether the connection is available to write to, or does that just state that the connection has connected to the server?

I have implemented a pinging system to keep traffic going across the connection, but dont want to have to rely on this to keep the connection alive.

Thanks for an insight

I may be seeing the same problem w.r.t. poor performance, I’ve started a separate thread for it:

http://www.java-gaming.org/cgi-bin/JGNetForums/YaBB.cgi?board=Networking;action=display;num=1058357118;start=0

In the meantime…

There have been MANY bugs in this particular part of the API, although AFAIAA most have been fixed now (but not necessarily all). Most of them were platform specific.

You might want to try a non-blocking read and check if it is returning -1 bytes (it would return 0 if there were no bytes to read). My own code is a hodge-podge of different techniques to detect dead connections. If that doesn’t work, let me know and I’ll try digging out the different things I’ve been using.

In a production server, I’ve got it working on the net without any problems with hanging connections - and this is despite the fact that people have been attacking it (and doing bad things, like breaking the protocol, disconnecting uncleanly, etc).

Also take a look at:

http://www.grexengine.com/sections/people/adam/adamsguidetonio.html

which I updated recently. It has some limited coverage of a few more bugs to do with this. e.g.: “some versions of Java 1.4.x actually require you to register OP_ACCEPT as well as OP_READ or OP_WRITE, instead of READ or WRITE on it’s own”…although there’s still a lot of stuff I haven’t covered there yet (if you think of anything specific, jog me and I’ll try and dig out notes and fill it in ;))

It’s also worth looking at the “known bugs list” for 1.4.2. There is at least one NIO-related bug.