... so never use UTF-8 encoding for binary stuff

Riven · April 20, 2009, 4:56pm

I rather often abuse UTF8 to encode binary to pass it into a textbased API.

Today, after years (!!), was the first time I got caught by non-reversible UTF8 encodings.


         byte[] original = ....;
         String encoded = new String(original, "UTF-8");
         byte[] decoded = encoded.getBytes("UTF-8");

         Arrays.equals(original, decoded); // false!

Gotta rewrite some stuff… shame on me !

bleb · April 20, 2009, 6:03pm

Don’t know why this stuff isn’t already in the JRE, but Base64 encoding works for me when I’m ramming binary data into java.util.prefs.

Riven · April 20, 2009, 7:44pm

It is there, in rt.jar, but not supported:

sun.misc.BASE64Encoder
sun.misc.BASE64Decoder

Abuse · April 20, 2009, 9:24pm

Riven:

I rather often abuse UTF8 to encode binary to pass it into a textbased API.

Today, after years (!!), was the first time I got caught by non-reversible UTF8 encodings.
         byte[] original = ....;
         String encoded = new String(original, "UTF-8");
         byte[] decoded = encoded.getBytes("UTF-8");

         Arrays.equals(original, decoded); // false!
Gotta rewrite some stuff… shame on me !

Presumably the cause of your problem is that ‘byte[] original’ contains a string encoded using modified UTF-8, rather than UTF-8? (caused by inproper use of dos.writeUTF elsewhere in your app.)

Though if that’s the case i’m surprised you hadn’t encountered a problem sooner; it’s unusual for binary data to contain no zeros!
Though perhaps the UTF-8 decoder used by the String constructor is silently accepting an Overlong encoding for zero, and you’ve only been caught out now because you’re data contains one of the UTF-16 surrogate pair byte values. (which are also encoded overlong in modified UTF-8)

If that’s the case the UTF-8 decoder used by Java is being very naughty - as accepting overlong encodings would mean it fails to meet the current Unicode compliancy requirements!

Riven · April 20, 2009, 10:00pm

I always was ‘serializing’ more or less textual data, but binary in the end - like what you get from DataOutputStream when your protocol is mainly string-based.

Today it simply went bezerk, due to the need to write binary in a text SQL column: ObjectOutputStream -> utf8 -> ObjectInputStream.

pjt33 · April 21, 2009, 8:02am

Why not use ISO-8859-1? That has 256 characters, so it’s a lot more suitable.