Best way of embedding binary data in source?

moogie · December 8, 2009, 10:00am

not sure if you meant the tool i made to embed binary data into a user defined class attribute. Such an option is moot which pack200 as it does not deal with unknown class attributes and thus means you are better off just leaving it as another file in the pack200 compressed jar.

pjt33 · December 8, 2009, 10:12am

Although it does have a facility for adding them, which may be handy for build scripts. I haven’t looked into it, but I noticed it in the output of

pack200 --help

moogie · December 8, 2009, 12:19pm

yes, you are correct, you are able to define how to handle your attribute, however i really doubt that there will be net gain (i.e. reduced size) when compared to using a separate file in a jar that is then compressed into a pack200 compressed format. This is because you will need to transfer the definition of the attribute as well.

I would love to be proven wrong though

Roquen · December 8, 2009, 12:40pm

So how do you access the attribute?

Abuse · December 8, 2009, 9:34pm

Presumably like usual; use Class#getResourceAsStream(classFileName), then search through for your unique identifier indicating the start of the attribute.

Groboclown · December 9, 2009, 1:52am

There’s a trade-off here between 100% packing by using binary attribute data vs. a 4 to 5 (or what have you) encoding in Strings, based on the constant size for retrieving the data.

I was planning at some point to sit down and actually try to compute the different cut-off values. Unfortunately, it won’t fully capture the possible additional optimizations that could be done on the data extraction process.

delt0r · December 20, 2009, 10:16am

Random thinking out loud thoughts on the matter.

Since the following seems to always be true:


byte[] data=new byte[65536*100];
rand.nextBytes(data);
assert Arrays.equals(data, (new String(data)).getBytes()));

Ok so as per the api this may not work. But works fine on everything i tested it on.

So the problem of storing binary data in a string seems to be enocding to and from charaters. What about “padding” a string to the correct length in the .java file and putting in the binaray data directly into the class file. I have hacked “enter product key” things this way, using only the strings program and vim!

pjt33 · December 20, 2009, 12:57pm

It’s worth noting here that pack200 supports a number of string encodings and if you give it flags to make the best effort it seems to try different ones to produce optimal output for the statistics of the string you give. In particular, if your data roughly increases then I think it will do difference encoding for you, so you can not bother and save yourself the code to undo it.

delt0r:

Since the following seems to always be true:
byte[] data=new byte[65536*100];
rand.nextBytes(data);
assert Arrays.equals(data, (new String(data)).getBytes()));
Ok so as per the api this may not work. But works fine on everything i tested it on.

That (usually) works on the same computer, but try shipping those bytes to another computer. The conversion is done with a locale-dependent character encoding, so it might be MacRoman on OS X (certainly used to be); UTF-8 on my Linux box; ISO-8859-1 on my old Linux box; etc.

Roquen · December 21, 2009, 7:57am

The tool code I posted above basically shoves whatever data you want into the CP entry. The verifier will reject the class if the string is not a valid UTF8 encoding. So a 0x00 byte will cause the class to be rejected since it is not valid.

delt0r · December 23, 2009, 10:17am

[quote]That (usually) works on the same computer, but try shipping those bytes to another computer. The conversion is done with a locale-dependent character encoding, so it might be MacRoman on OS X (certainly used to be); UTF-8 on my Linux box; ISO-8859-1 on my old Linux box; etc.
[/quote]
I don’t think the “encoding” outside byte packing changes the bytes. So UTF-8 does the 1, 2 or 3 byte thing with restrictions on what can be encoded in each set as stated above. The encoding is more what char gets decoded to what “letter”? as i thought. For one i have never heard of MacRoman. That sounds like a char->font thing.

Also i have done this over the network to other machines and had no problems. But some machines may default to UTF-16 or something so i should use the versions that specify encoding.

pjt33 · December 23, 2009, 10:35am

You’re getting encoding and charset the wrong way round. (It doesn’t help that the Java class names are wrong!)

The charset (mapping between values of the char datatype and “letters”) used by Java is Unicode. The String methods which convert to and from bytes and don’t take a Charset (that is, encoding) argument use the default encoding, which varies by platform and locale.

import java.nio.charset.Charset;

public class EncDemo {
        public static void main(String[] args) {
                System.out.println(Charset.defaultCharset());
                byte[] bytes = "\u00fe".getBytes();
                for (byte b : bytes) {
                        System.out.print(Integer.toHexString((b >> 4) & 0xf));
                        System.out.print(Integer.toHexString(b & 0xf));
                        System.out.print(" ");
                }
                System.out.println();
        }
}

A quick test on an OS X (10.5.8) box gives:

$ java EncDemo
MacRoman
3f

whereas my Kubuntu box gives:

$ java EncDemo
UTF-8
c3 be

but if I change the locale:

$ LANG=en_US java EncDemo
US-ASCII
3f

delt0r · December 23, 2009, 10:53am

mmm Thanks… But i do hate it when i learn something that says some deployed code is wrong (but its working!). Guess I got lucky and the encoding is set somewhere in the app.

So that leaves us with strings and some unpacking logic i guess. I did some tests last night, strings to seem to pack rather tightly into the archives. Random data does not expand the pack200.gz by much more than the data (24 extra bytes from 1024). However there are some illegal values so that not really quite 1024 random bytes.

pjt33 · December 23, 2009, 11:45am

Maybe you’re just using values in ASCII, which are the same in most encodings which aren’t designed for locales which use a non-Latin alphabet. I’ve been badly bitten before, to the extent that I’ve added a sanity check string to my main data file which will break if it gets mis-transformed (e.g. saved as UTF-8 and loaded as ISO-8859-1) at any step.

delt0r · January 3, 2010, 11:11am

Well after reading up on Pack200 and constant pools and everything else that isn’t real work (Molten salt reactors for the win!) I have found a pretty easy way to get the data into the jar/pack200 file.

Just use a resource file!

Really, i get a 2 byte overhead including a file named “run” (ie its already in the global constants pool) of zero length. I also get pretty good compression with non zero length files. Now the best gains are with the fact that the decoding logic is minimal as compared to strings.

Turns out this would be the same overhead in a pack200 file as a attribute that is just stored. ie its very close to just appending raw bytes to the class. I can’t see any other method getting close really.

moogie · January 3, 2010, 11:49am

yup, that was my conclusion when i investigated converting the “embed binary data as an class attribute” trick for use in pack200… dont bother for the reasons you stated above.

pjt33 · January 3, 2010, 2:27pm

Are you going to show us what it is? Sounds to me like a getClass() call, a Class.getResourceAsStream(String) call, and an InputStream.read() call at minimum, which doesn’t seem to compare particularly favourably with String.charAt(int).

delt0r · January 3, 2010, 2:45pm

Its an input stream and ClassLoader.getSys… I don’t remember what the test came up with. But since to get any char from a string you must encode it, most bytes expands to 2 bytes with the high bit/s set (some to 3 bytes). This seems to be bad news for gzip and pack200 does not deal with these well (optimised for strings that are class names for clear reasons).

Over all its a pretty big difference in my case. 10 sprites(line art) is expanding the archive by just 30 bytes now rather than 100, and the addition of ClassLoader and inputstream are worth it (about 30 bytes IIRC). The loops to put data into datastructures is still the larger part of it.

Demonpants · January 4, 2010, 3:27pm

A related but also separate question:

Currently I have my level data stored something like this:


wwwwwwwwww
w s  w  ew
w    w   w
w    w   w
w        w
wwwwwwwwww

That goes into an external txt file, sans extension and with one character in the file name. Now, the question is: would be perhaps be a better idea to put the string directly into the Java source like this:


String levelData = "wwwwwwwwww\nw s  w  ew\nw    w   w\nw    w   w\nw        w\n wwwwwwwwww";

And is there an even better way to do it?

delt0r · January 4, 2010, 3:33pm

Well in this case all the characters you use do in fact take only 1 byte in class file UTF-8. Also you could use the statistics of the English language to pick characters, since more common ones use less bits with Huffman coding. (ie use ‘e’ rather than the rare ‘w’). So I would think a string could work pretty well in this case, and you save having the ClassLoader.get… method and InputStream classes in the constant pool as well.

Demonpants · January 4, 2010, 5:48pm

Cool, I’ll have to look up Huffman coding and then just choose letters that work better. I only need like 5 characters, after all.