SocketChannels die, I'm not invited to the funeral

Is this a design flaw in the NIO classes? I can’t find an API-way to guarantee being informed when a SocketChannel gets disconnected…the best I can do is to discard the whole point of NBIO and periodically read from every SC, ignoring the Selector!?!

There’s no “OP_DISCONNECT” to listen for, and it seems that most of my SC’s just drop silently, without triggering an OP_READ event (which I would have expected).

I thought this was a bug from 1.4.0 / .1 that had been fixed by now, but I’m getting this with 1.4.2

The funny thing is that you won’t realise it’s happening, and it’s never been a problem for me before (because I didn’t care). But now I have an app where I have OTHER code that is dependent on being notified when an SC disappears. I store info for each SC as it is accepted, which I then need to supplment with data like “total time connected” - which I can’t fnd out unless I get close/disconnect notification.

I’m currently seeing behavior synonymous (sp?) with C libraries I’ve worked with before. The selector reports an FD (channel) as being ready to read (OP_READ) when its been closed. Reading from that channel fails.

I’ve tried stuff on XP and Linux with the same results. What platform are you seeing these problems?

Kev

PS. I’m working on 1.4.1_02-b06
PPS. Great topic title… we need more like this :slight_smile:

[quote]I’m currently seeing behavior synonymous (sp?) with C libraries I’ve worked with before. The selector reports an FD (channel) as being ready to read (OP_READ) when its been closed. Reading from that channel fails.

I’ve tried stuff on XP and Linux with the same results. What platform are you seeing these problems?

PS. I’m working on 1.4.1_02-b06
[/quote]
Yeah, my thoughts exactly - I’d coded in the expectation an OP_READ would come through on SC death. However, IIRC there has been at least one bug for this not happening as expected.

Unfortunately, the dreadful state of the NIO docs being as they are, there is no “official” answer on this. Hard to tell if it’s a bug or a feature ;).

I’m running on linux, and using 1.4.2 (there were some other bugs in NIO that got fixed for 1.4.2 that were causing me problems, so I upgraded ASAP).

I’ll see if I can find a 1.4.1 install to try this on, it could certainly be a regression (as I said, I’ve never needed to monitor this before :(). I’ll also see if I can reproduce under XP. Hmm. Thanks for the thoughts.

[quote]I’m currently seeing behavior synonymous (sp?) with C libraries I’ve worked with before. The selector reports an FD (channel) as being ready to read (OP_READ) when its been closed. Reading from that channel fails.

What platform are you seeing these problems?
[/quote]
Ha! Confounded again by the idiot who (un-) documented NIO. It’s returning dropped connection on OP_WRITE for me !

Yay. Unfortunately, that means that my ACCEPT and READ Selector’s get no notification at all.

I’m really pleased that between 1.4.1 and 1.4.2 they apparently changed the linux implementation to go from READ to WRITE. That’s gonna hurt a lot of current code :). Obviously, it’s perfectly fair, because they never explicitly documented which way around it was going to be, so it’s OUR fault for trying to use it </joyful sarcasm>

Actually, I suspect it’s dependent upon what underlying NBIO you have available on your OS - and linux can have any of about 5 different NBIO libraries at the moment, depending on what you installed.

I would guess that this is more “kernel dependent” than it is “JDK version dependent”…I’m running 2.4.18. You?

Isnt it enough to just listen if read(buffer) returns -1 which would mean the end of stream?

[quote]Isnt it enough to just listen if read(buffer) returns -1 which would mean the end of stream?
[/quote]
Ahem. How, exactly, would I do that? (think about it: you couldn’t write source code that would actually be executed in an NBIO situation).

Bear in mind that the way NIO works (with Selector’s) is that it’s Event-driven. If no event fires (to let you know you can perform a read), you cannot (well you can call the read() method, but see below…) perform a read.

You could use a select( long ), and “read and be damned”, but in legitimate situations, you would:

A. Lose the performance boost you got from using event-semantics

B. Potentially block permanently on every read that WASN’T a dropped connection, thereby making your server useless. (this is under-specified by the NIO API - it is not even guaranteed that an NB channel will not block even when the SelectionKey says it won’t. However, the wording leads me to believe that the only reason it is not guaranteed to be accurate is because you (the programmer) may invalidate it yourself, by doing the read - and NIO doesn’t automatically update the status if you do.)

Well, in my nice higher-lvl api I use read() to determine the end of stream on a SocketChannel and it works every time… Obviously the channel is registered as readable once again after it has delivered the last message to the buffer. Thus, able to deliver the -1 message that means end of stream. I tried this by connecting-disconnecting several times with different combinations (writing to the server and disconnecting as fast as possible etc) and it worked every time.

[quote]Well, in my nice higher-lvl api I use read() to determine the end of stream on a SocketChannel and it works every time… Obviously the channel is registered as readable once again after it has delivered the last message to the buffer. Thus, able to deliver the -1 message that means end of stream. I tried this by connecting-disconnecting several times with different combinations (writing to the server and disconnecting as fast as possible etc) and it worked every time.
[/quote]
I’m confused here. That sounds exactly like kevglass described - he gets an OP_READ event when the connection is dropped. To recap, my problem is that I was hoping to get one, but I do not. I think that basically what you’re saying is that it works for you the same way it works for kevglass.

The problem here is that without an event from the select() to indicate that the connection has been dropped, there is no way of detecting it has been dropped!

My followup post was to say that on a Selector with OP_WRITE registered, I did actually get an OP_WRITE notification when the channel was dropped. If this is consistent (note: NONE of this is documented / officially specified in the API), then I could try registering for OP_WRITE on my read-only Selector.

In fact, this sounds just like the bug report I remember from Sun - if you do NOT register OP_WRITE, you do NOT get the OP_READ notification on a closed connection, but if you ONLY register OP_WRITE, you get an OP_WRITE notification instead! (but I thought I read that as one of the “fixed” bugs for 1.4.x. Possibly this has regressed with 1.4.2?)

Heh, if you want I will upload my source and you can take a look.

[quote]Heh, if you want I will upload my source and you can take a look.
[/quote]
OK, but first it’d be easier if you explain what event on the Selector you are reacting to when you do your read? And if you’re not reacting to any event at all, how are you doing non-blocking I/O without Selector’s? Otherwise, I’m not going to have a clue what your code is doing :(.

I am looking for isReadable() event, and then reading to a buffer… As I said, this has worked every time so far :slight_smile:

Anyhow, here is a page with my api and the javadoc:

SOURCE: http://www.naturalgamer.com/OverConn/OverConn.zip

JavaDoc: http://www.naturalgamer.com/OverConn/Javadoc/

So you’re registering for OP_READ and OP_WRITE for every channel.

Interesting (and slight odd and worrying point) is that you have a selector for every connection/channel pair. From what I understand (not sure) the point of selectors is partly to reduce the need of a thread for blocking communication for every connection. By having a selector (which you block on) for every connection you’re not getting a true benefit… that being said, you’re apparantly recieving a -1 everytime a channel closes.

For what its worth, I’m using one selector for all my channels, meaning I only need one thread. I only get a -1 on the channel when a TCP socket closes if the other end with killed (as opposed to closed cleanly).

BlahBlahEtc - I did some more looking into it, and I don’t reliably get -1 on the TCP channel when it closes. Only if the other end is killed. I’ve added a keep alive to my TCP channel which means I get a close within 3 seconds, but its not really good enough.

Kev

PS. Thanks for the source

No no… if you look at the OverConn class, it is made to be the ONLY Thread, and when you need a connection, you call .connect() which returns a OverChannel and registers it to the selector. Thus, you only use one selector for all the channels, as long as you only have one OverConn Thread running.

Some simple app that needs to connect to 5 irc servers at once would look like this…


OverConn overConn = new OverConn(50, 5000);
overConn.start(); // Started Thread

OverChannel client1 = overConn.connect(OverChannel.TCP, “wineasy.se.quakenet.org”, 6667, “ISO-8859-1”);

OverChannel client2 = overConn.connect(OverChannel.TCP, “wineasy.se.quakenet.org”, 6667, “ISO-8859-1”);

OverChannel client3 = overConn.connect(OverChannel.TCP, “wineasy.se.quakenet.org”, 6667, “ISO-8859-1”);

OverChannel client4 = overConn.connect(OverChannel.TCP, “wineasy.se.quakenet.org”, 6667, “ISO-8859-1”);

OverChannel client5 = overConn.connect(OverChannel.TCP, “wineasy.se.quakenet.org”, 6667, “ISO-8859-1”);


That would leave 5 connections running in that one OverConn thread… Following the multiplexing paradigm.

Ah, sorry, only had a quick look over, must have missed that. I think the naming confused me a bit. I thought OverConn would be a connection object, one created for each connection… fair play, didn’t get it. My apologises.

Kev

np, heh… cool… while you were writing that, I edited my message to include the code, so while I saved the modification, your message appeared =P

The naming is chosen so that it is clear its only a abstraction of the “real” java.nio channels.

Also, something I need to fix is that the channels gets registered as OP_WRITE only when it has something to send.

This also raises another question, is it wise to un-register a channel from a selector in order to register it again with new options? This would be nice for channels that are idle for long times, but channels that need to send alot would be re-registered alot and this could impact on performance and perhaps introduce bugs?

[quote]This also raises another question, is it wise to un-register a channel from a selector in order to register it again with new options? This would be nice for channels that are idle for long times, but channels that need to send alot would be re-registered alot and this could impact on performance and perhaps introduce bugs?
[/quote]
C/C++ non-blocking I/O libraries vary wildly in the answer to your question. It’s entirely implementation dependent. So, for Java, it needs to either have an API call which does something akin to “getCapabilities” (or something that indicates what the Selector is good at etc), or for Sun to mandate how any given implementation should work.

Either way, it ought to be documented.

I would suggest you submit a bug report to Sun on this. If they get enough people telling them they need to document the NIO API’s, they might actually do it.

I’m pretty desperate here. I’ve found several scenarios in which disconnect is NEVER reported, no matter what you do. I also fear that a production server where we’re running Sun’s 1.4.2 linux is doing some form of processing that’s O(n) or worse in the number of things registered with the Selector. After a few days it’s maxing out the CPU whilst doing nothing but ordinary selects, and I’m afraid that it’s all those dead connections that are causing the problem. The only way we can get our application to work at the moment is for someone to login and quit java each morning, and restart the server. This is beyond ridiculous - this is pure farce.

How have you implemented your keep-alive, and does it work for you in all situations? On some ports, I can change the protocol and force heartbeat/keepalive, but other ports have to be HTTP - which AFAICS makes keepalive impossible ???

Argghhh.

I’ve got no HTTP connections to worry about, but for what its worth, assuming you are just downloading a bunch of data down the HTTP connection you could just force a disconnect if you don’t recieve data for a while. Hideous but I suppose it might work.

Kev

[quote]I’ve got no HTTP connections to worry about, but for what its worth, assuming you are just downloading a bunch of data down the HTTP connection you could just force a disconnect if you don’t recieve data for a while. Hideous but I suppose it might work.
[/quote]
Chuckle. I’m desperate enough to give it a try :). Whilst reading this it’s also occurred to me that the CPU problems always occur on the server/app that is also running an SSLServerSocket. A light bulb above my head is starting to glimmer…I’d assumed that the SSLServerSocket was a mature implementation that wouldn’t spontaneously after a matter of hours start hammering the CPU for no apparent reason. Especially when there is ZERO activity on any of the ports.

I would still wager that the problem is from nio, but it gives me another avenue to try. I’ve got five or six things now. Looks like it’s going to be a loooong night :-/