bug in writing to socket???!

Hello,
I encountered strange problem. I have a multiplayer game server, which communicates with clients through standard sockets (I’m not using NIO for several reasons). The scenario is pretty simple: I have 2 threads, first one waits for new connections, the second handles everything else.

The server works very well - runs for days, weeks, handles thousands connections without a problem. BUT then suddenly, out of the blue: the server froze up. No exception, nothing. Interesting is, that it was still possible to make a connection, but it was not possible to communicate. It means, that “connector” thread was still running, but “worker” thread froze up. Well, naturally, I killed the server, restarted it with remote debugging enabled and waited.
After several days, same thing happened again. I started remote debugging session to find out where is the problem, and:

  1. “connector” thread was working, as expected
  2. “worker” thread was frozen at OutputStream::write(byte[], int, int) >:(

How could this happen? As far as I know, writing to socket is non-blocking, it doesn’t even throw an expection.

After several weeks of experiments I found out, that it doesn’t stay frozen forever, it continues after some time (not sure how long, but it must be approx 10 minutes)

All I can guess is, that there is either some bug in java, or in operating system. I’m running 1.4.2_06-b03 on Fedora Core 3.
Anybody experienced something similar? Any suggestions?

Writing to a old IO socket is blocking. It is possible that client1 stopped responding and blocks the write. That means your server stops working and you can’t write to any other clients. That is why you’ll need a read and write thread for each client. That way they don’t block eachother out.

Sounds a bit like a MT deadlock issue. Maybe the holding time it the same as the system dependant TCP timeout?

[quote]Writing to a old IO socket is blocking
[/quote]
I believe this is just common myth. Only reading is blocking, and even this can be easily prevented using available() method. Writing seems to delegate the operation to some lower layer and exit immediately.

[quote]It is possible that client1 stopped responding and blocks the write
[/quote]
This happens with each connection. I don’t have disconnect/logout feature :slight_smile: The user just closes the brower and connection breaks. (Client is applet) The server seems to be able to detect it. It runs for several days and handles hundreds, sometimes thousands connections without hiccup before it happens. But once I also saw it happen 10 minutes after starting the server, so it is not determined by some amout of previouos connections. And in general it’s quite rare, it happens once per several days.

[quote]Sounds a bit like a MT deadlock issue
[/quote]
If it is MT issue, than it doesn’t originate in my code. All communication happens in single thread.

[quote]Maybe the holding time it the same as the system dependant TCP timeout?
[/quote]
this is quite possible. I’m not sure what is system TCP timeout on my computer, someone knows how it can be checked & changed on Linux?

I’m afraid that if I don’t resolve it somehow, I will have to try NIO on the server side, but I’m not very keen in doing it, as it will require quite effort, and probably also introduce several new problems.

I believe write() will block IF the write buffer in the OS is full. Maybe that will trigger some thoughts.

I believe this is just common myth. Only reading is blocking, and even this can be easily prevented using available() method. Writing seems to delegate the operation to some lower layer and exit immediately.
[/quote]

* blahblahblahh humbly submits that he feels what YOU are saying is the myth, that he has used old I/O for many years, that it always was blocking, but lots of people got away with broken code for years with good luck.

But since the stuff people were doing that they justified by “old I/O is nonblocking” was stuff they shouldn’t have been doing anyway, I generally just kept my mouth shut on that issue and pointed out the more obvious problems instead. I haven’t had time to properly check the status of old I/O from 5 years ago :(.

I will also point out that old I/O is FUBAR when it comes to connections, timeouts, and disconnects. Back when I was using old I/O in produciton systems (pre-NIO), I found some lethal bugs in the I/O design of java that make it impossible to use I/O for a serious server: stuff like:

  • a connection can hang the thread doing the I/O
  • …in such a way that it is IMPOSSIBLE in java to terminate OR interrupt the thread

…so you could easily get threads that could not in any way be terminated by java, but which were stuck, short of running an exec( “kill -s 9 [pid]” ).

That may, indeed, be part of what you’re seeing here - I have very stressed memories of the problems it caused us at the time :(.

[quote]Maybe the holding time it the same as the system dependant TCP timeout?
[/quote]
I programmed some code which measures the length of this “delay”. For some reason it is always exactly 1080 seconds (18 minutes). So I guess you are right, it probably has to do something with system TCP timeout. I’m not sure whether it is bug or feature, but it is very annoying in any case :frowning:
Maybe I could create some “guard” thread which will notice when the main thread freezes, and closes the socket which caused it. Not sure if it’s going to help (if it stops blocking immediately)
I have really no idea what’s the reason of it all, it happens so infrequently :frowning:

Sadly, it is the only single issue I have with old java IO. If I don’t fix it somehow I’ll have to rewrite the whole server from scratch (using NIO), currently it is too tightly coupled with old IO :frowning: