[SOLVED] Bytes And Java Strings

jonjava · April 16, 2012, 5:43am

Hi.

I’ve recently been working a lot with java.nio, but mainly Channels, SelectionKey’s and ByteBuffers.

I’m having difficulties with reading data through channels. (Everything sent and received through Channels are in plain Bytes). I can send and receive just fine, and my echo server works without problems. But like I said, I can’t interpret the data that is being sent in a useful way.

What I’ve been doing is having test servers running on localhost and connecting to these through telnet. The Data I receive form telnet comes up as either gibberish or what seems to be nothing at all. for Debugging I’ve been trying to echo to the console with System.out.println() the data I’ve received but nothing sensible comes up in the slight. I’ve tried appending strings, reading asCharBuffer etc, the only thing remotely close to an alphabetical letter I’ve come to is the ‘?’ letter… and those only appear after CR.

Plain text is written in telnet and nothing shows up on console no matter what - why is this??

65K · April 16, 2012, 9:11am

Hard to say anything without code samples.
Especially because NIO is involved. ;D

Stranger · April 16, 2012, 11:05am

I think the problem is in wrong encoding/decoding of strings.
So, this possibly can help you: http://stackoverflow.com/questions/1252468/java-converting-string-to-and-from-bytebuffer-and-associated-problems.

jonjava · April 16, 2012, 4:42pm

I see, thanks for highlighting character encoding. The SO page you linked sent me to an interesting article by Joel Spolsky (co.f. of SO ) called “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)”

I was in the middle of reading Java NIO by Ron Hitchens and stopped just after covering channels, keys and buffers to try out the stuff I had read.

This was all fine except for the character encoding issues I had. I see now that Character Sets are covered in the book in the following chapter. :L Which I’m now going to read in mild embarrassment. I’ve been quite ignorant towards the importance of how bytes are interpreted and the decades long struggle and ingenuity that’s been going into figuring out the whole mess.

The Article really opened my eyes quite a bit so thanks again:P

Riven · April 16, 2012, 6:12pm

If you’re using the characters 0-9,a-z,A-Z, it should (almost) always work. If you still have odd characters, you simply have a bug in your networking code (byte transfer). No encoding will save you from that

DrZoidberg · April 16, 2012, 9:03pm

btw. A CharBuffer always uses 16 bit unicode encoding.

jonjava · April 16, 2012, 11:33pm

I fixed it by simply decoding the received data with utf-8.

protected static Charset cs = Charset.forName("utf-8");

// ...

CharBuffer buf = cs.decode(buffer);
		message += buf;

I checked my systems default charset with Charset.defaultCharset().name() and it spit out “windows-1252”

I’m thinking that might’ve been the issue.

ra4king · April 16, 2012, 11:57pm

I recommend you just stick with UTF-8:


//write the String
byte[] b = myString.getBytes("UTF-8");
byteBuffer.putInt(b.length);
byteBuffer.put(b);


//read the String
int len = byteBuffer.getInt();

//make sure len is not a ridiculous number that could crash your application ;)

byte[] b = new byte[len];
byteBuffer.get(b);
String s = new String(b,"UTF-8");

EDIT: heh, didn’t see that last message. It’s best to avoid Charset since it’s slow. See code above.

jonjava · April 17, 2012, 12:15am

I think the issue in its core was that since channels are inconsistent in how many bytes are actually read, that the decoder wraps around the bytes it is fed and keeps them in track for you.

I’ll post my simple echo server in java.nio here for those interested to look it over.

http://www.java-gaming.org/?action=pastebin&id=62

ra4king:

I recommend you just stick with UTF-8:
//write the String
byte[] b = myString.getBytes("UTF-8");
byteBuffer.putInt(b.length);
byteBuffer.put(b);
//read the String
int len = byteBuffer.getInt();

//make sure len is not a ridiculous number that could crash your application ;)

byte[] b = new byte[len];
String s = new String(b,"UTF-8");
EDIT: heh, didn’t see that last message. It’s best to avoid Charset since it’s slow. See code above.

What if the non-blocking channel.read() didn’t register all the bytes that you’re attempting to read with len?

And in your reading code, where exactly do you read the data into the buffer? o.O
Am I blind or do I only see you making a string of an empty allocated byte array of length len?

ra4king · April 17, 2012, 2:24am

Oh whoops, I missed that line. I added it in the original post

And that’s where you make sure you read all the bytes first. In my networking “library”, I send an integer containing the length of the ByteBuffer that was sent. In my read method, I make sure I read at least that many bytes before returning the data.

jonjava · April 17, 2012, 3:06am

I see, that’s quite nifty.

Does it go against concurrency if you read the bytes stuck inside a loop until you reach the len amount, or it is better to wait for the next round-about and keep your key selected? (Ie. keeping it in the Selectors.selected() set).

But what about protocols that don’t adhere to the first int is the length of bytes sent? Or is this a non issue in general since you’ll always know what you’re going to get?

And why is the Charset class slow compared to the Strings getByte( str utf ) function?

Riven · April 17, 2012, 4:50pm

It’s indeed a non-issue, because the used protocol must be known ahead of time.

It’s not. Both have the same implementation.