just use the most common letters first, since a class file seems to be somewhat dominated by Strings in the constant pool, this should give you about as good as it will get. You don’t need to understand Huffman encoding, other than more frequent characters will take less bits.
All right, thanks. That oughta do 'er. I’ll just use vowels and common consonants.
I don’t think you’re passing the right options to pack200. Tell it to take its time and it should try out various different string encoding mechanisms and pick the one which works best for the statistics of your long string.
@Demonpants, ditch the \n and either hard-code the level width or encode it as the first character. You might find that using characters which are close to each other allows pack200 to do well with delta-encoding. And I would endorse using \u0000 for a common character because that’s a very common byte in pack200 files, and should give you good entropy even if it isn’t delta-encoded.
But in modified UTF8 \u0000 is encoded as 0xC080, not 0x00?
Unless 0xC080 is the byte sequence that you are saying is very common in pack200 files?
All right, thanks guys. I’ll pre-program the width and height and take out the \n.
Can someone perhaps list the 10 most common characters to use in Pack200, or should I just use \u0000 and the most common letters of the alphabet?
@pjt33
I have tried many options with both pack200 and 7zip. I even get slightly better performance that Rivens tool with these 2 tools. (but kzip/bjflate beat it by a bit). Also in strings u0000 is not a common item as least in the class files (after a pack200) I have checked. And all strings are encoded with modified utf-8, which is not altered much by pack200 except to gloabalize the constants pool and to make some effort with common prefixes.
At any rate, i am using less bytes now and can forget about encoding/decoding tricks and don’t have to unpack bytes from chars.
That’s true. I’d forgotten that. I’ll have to look at the pack200 spec again to see whether it mentions handling of NUL characters.
[quote=http://wiki.eclipse.org/index.php/Pack200]Pack200 reduces the size of a JAR file by:
- Storing internal data structures.
[/quote]
Any idea what this is talking about?
Does Pack200 do some magic on arrays defined in class files?
It would make sense that it did - as this is one of the most inefficient structures in a Java class file.
If that’s the case, then simply leaving your data as arrays in the class file may turn out to be the most efficient solution! :-X
Pack200 stores the internal data in a way that makes life easy for GZip (deflate) and reduces redundancy where it can (works really well with lots of classes). Unfortunately quite a few things still end up producing byte code. It can store byte code well so that deflate will compress it well, but you still end up with a lot of constants in the constant pool.
I did try this with arrays and it does not work well at all.
Remember that the notation
int[] data={1,2,3,4};
is really syntax sugar, not something that exists byte code… ie its translated as stated in above posts.
Ok. An excerpt from the pack200 spec:
[quote]Each value in the band cp_Utf8_chars is a 16-bit number expressing a Java character. This band contains the characters of all small suffixes, in order. For each successive string, cp_Utf8_chars contains an additional run of values encoding the characters of its small suffix, if any. Therefore, the total length of this band is the sum of all values in the cp_Utf8_suffix band.
Whenever a small suffix length for a constant pool entry is zero, the string has no small suffix, but a big suffix instead. The length of each big suffix is given by an element of the cp_Utf8_big_suffix band. (Therefore, the length of this band is precisely the count of zero values in the cp_Utf8_suffix band.) Each big suffix is transmitted as a separate band of 16-bit character values, one band element per character. There is one such band per big suffix. These bands immediately follow the cp_Utf8_big_suffix band, and are collectively called the cp_Utf8_big_chars bands. Although normally data of the same type are collected into a single band, these strings are placed in separate bands so that they may be independently encoded. These strings typically encode arrays of binary data, rather than true Java characters.
[/quote]
So it doesn’t use the modified UTF-8 at all. Instead it uses the true char values and an appropriate encoding. With suitable options (the effort flag) it should try a lot of different encodings (there are about 100 supported) to find the best one. I don’t know whether this requires -E100.