Problem losing data

erikd · March 4, 2008, 2:35pm

Hi there,

I’ve been fighting a problem for a few days now and I’m about out of ideas.

In short, this is the story:
I’m writing a client for a server, which uses plain TCP sockets to connect. The receiving of data happens asynchronous, so it happens in a separate thread.
It works great, but every now and then (it happens very rarely), some data sent by the server to the client seems to get lost, leaving the client in a state where it just loses track of the protocol.

One possible way the data could be screwed if 2 threads try to read from the socket simultaneously, so I’ve been very careful that this can’t happen and I’m pretty sure this doesn’t happen.
I’ve asked the server guys to put a trace on what the’re sending back to the client, and that seems to be okay, even if the client lost some data.

Has anyone of you experienced something like this? Any ideas?
Since I can’t post source code (I’m not allowed to), I’m basically just trying to get some fresh ideas to be able to track down the problem.

Cheers,
Erik

kevglass · March 4, 2008, 2:39pm

NIO? Some buffer based read method sigs are slightly unintuitive.

Kev

erikd · March 4, 2008, 2:56pm

No NIO, just plain old java.net/java.io

Riven · March 4, 2008, 7:37pm

If it happens VERY rarely, it can be corrupted TCP packets that ‘luckily’ are mangled exactly so that their CRC is correct again. I read somewhere on this forum that someone implemented a MD5-like protocol, and found 7 corrupt tcp-packets over a period of a few years.

If it happens like every hour or so, it must be in the application-layer. I’ve used Sockets for years, and I have applications that keep up and running for months continiuously doing I/O - when they crash, I know it’s always always my fault.

If you want to be sure your InputStream/OutputStream gets accessed from only 1 thread ever (and the code using it is too large to quickly analyze) you can make a InputStreamWrapper that checks the Thread.currentThread() against some specified Thread, to quickly identity the offender. It’s also good to simply make another stream wrapper that writes everything sent/received to a file, to analyze later. (you can re-run the communication by reading the file back into your protocol-handler)

If that doesn’t show any problems, it’s simply in the protocol handling (or corrupt memory - not uncommon). I bet it’s a binary protocol, so differences in character-sets on both systems wouldn’t get you into problems, and as you’re using IO, as apposed to NIO, you can be fairly sure you master the API, so you can rule out that one too.

In short, without seeing any code, i’m pretty sure the problem is a bug in your code (which means: fixable!)

erikd · March 5, 2008, 10:12am

Thanks Riven.

I only hope it’s my code and not the server

I’ve tried your suggestion, and it confirmed that the streams from the socket are not touched by multiple threads at once. There is some thread switching going on, but only during the log on procedure (the communication is synchronous during logon, and swithes to asynchronous after successfully logging on. After having logged on, there’s no thread switching anymore).

The protocol is indeed binary, and although there can be text inside exchanged messages, the problem is on the binary level.
Sometimes the problem doesn’t happen for days, sometimes it happens 10 times a day.

To be honest, I’m still not completely ruling out the server and maybe the trace they sent me lied :).
I think I’m going to do some tracing on the network level, outside of the client app…

CommanderKeith · March 5, 2008, 12:49pm

It could be that you’re not flushing streams. When I found similar problems I wrote the number of bytes I should expect at the beginning of the message, and if that wasn’t what I recieved, I stopped the program right there by throwing a runtime error.

erikd · March 5, 2008, 2:32pm

Could be. In which case it would be a bug in the server, which I don’t have the sources of unfortunately…
I guess I’ll have to do a network trace to make sure it’s not in my code.

Anyway, thanks all for the replies.

Riven · March 5, 2008, 11:22pm

Don’t waste your time on that one. Java network IO is not buggy!

I don’t really know what you mean by “the trace they sent me”, but if it isn’t already so, could you ask the server-guys to have an OutputStream subclass that writes to multiple OutputStreams… like:

newSocketOutput = new MultiOutputStream(socket.getOutputStream(), new FileOutputStream(dst));

Clientside, build a special InputStream… like:

new SocketInput = new SnifferInputStream(socket.getInputStream(), new FileOUTPUTStream(dst));

When wait for the ‘problem’, request the file from the server, and read the file from the client. They should be identical, always.

BTW: Are you using in.skip() without handling the return-value properly? (Write an inputstream-wrapper that blows when… blah blah blah, you get the point :))

As you have probably figured, I have a dozen of InputStream/OutputStream subclasses laying around, which speedup my network-debugging significantly. ;D

erikd · March 6, 2008, 11:18am

[quote]Don’t waste your time on that one. Java network IO is not buggy!
[/quote]
Oh, I’m sure Java network I/O is not buggy.
The thing is, I’m now becoming pretty much sure it’s the server, while the server guys say it must be the client (and my manager seems to lean towards the server guys) so I need some hard evidence outside of my client app, using an external packet sniffer or something.

The low level traces I make now is done similar to your suggestion.
The server side is C++, being developed by an external company, and I’m not sure how and where they create their traces. I can’t even configure the server, I only have an IP address & portnumber to test with + some out of date and incomplete documentation… :-\

[quote]BTW: Are you using in.skip() without handling the return-value properly?
[/quote]
No

[quote]As you have probably figured, I have a dozen of InputStream/OutputStream subclasses laying around, which speedup my network-debugging significantly
[/quote]
Heh, well it seems like a handy approach.

erikd · March 10, 2008, 11:40am

FWIW, the problem turned out to be in the server after all… :-
I had to convince everybody with wireshark traces, but at least everybody agrees now.

Riven · March 10, 2008, 9:32pm

They should be grateful

blahblahblahh · April 9, 2008, 10:44am

Define “lost”.

Does the data leave the server’s NIC?
Does the data get to the client’s NIC?

JAW · May 9, 2008, 9:49am

At first I would log anything as it gets received. Can you already tell if the data never arrives at the most basic read method? When using TCP, I am very certain, that everythin arrives correctly. So I would be very sure, that it is a programming error somewhere in the data handling of the application.

So I would log what arrives and log what the application gets and try to find the place where its lost.

The problem is, finding such bugs is extremely hard when you cannot reproduce the error systematically. So you really have to log a lot, hope the error happens, and then search the logs.

-JAW