Best way of embedding binary data in source?

sws26 · December 3, 2009, 10:15pm

I know it is possible to embed data into class files or the jar but for reasons I wont go into I want to do it in the source. I have tried:

Integer/Byte/Char/Short arrays:
int[] data = new int[1,2,3,4,5,6];

Are all terrible - each individual element seems to have some overhead associated over and above the storage requirements.

String literals eg:
String data = “\u0001\u0002\u0003\u0004”

Works well and has just the fixed overhead for the string object. However it is subject to unicode restrictions and some characters are illegal.

Or would it be better to just:
String data = “01234567ABCDEF0102” and hope pack200 can work with the redundancy?

Any ideas welcome.

Alan_W · December 3, 2009, 10:32pm

All the integer types (except long) are stored in 32 bits regardless of the variable type. To minimise the footprint pack your data so it uses all 32 bits and use a static intialiser. I put mine in the class data and mark it private.
e.g. private int[] data = {0x12345678, 0xFEDCBA98, 0x55AA55AA};

sws26 · December 3, 2009, 10:38pm

Adding an extra 4K worth of integers (1024 ints) to a private static final array of integers increases the class file size by over 12k.

Abuse · December 3, 2009, 10:42pm

Steer well clear of arrays at all costs.

int[] blah = new int[]{0,1,2,3,4,5,6,7,8,9};

Is equivalent to:

int[] blah = new int[10];
{
blah[0] = 0;
blah[1] = 1;
blah[2] = 2;
blah[3] = 3;
blah[4] = 4;
blah[5] = 5;
blah[6] = 6;
blah[7] = 7;
blah[8] = 8;
blah[9] = 9;
}

Depending on the size of the array, that’s 2, 3 or even 4 additional bytes per element ontop of the actual data being stored!

Groboclown · December 3, 2009, 10:46pm

If you look at the bytecode for using these array structures, you’ll notice that each number in the array is added in the static initializer block. That, and each integer not in the byte range is added to the constant pool, means that the structure:

static final int[] x = { 1024 }

adds 5 bytes for the constant pool for the 1024 value, and about 3 bytes for adding the 1024 value into the array, and the overhead for creating the array. (I don’t have the classfile spec in front of me, so I don’t know if the push value into the array size is right).

Alternate options are to insert data into the classfile as an attribute block, or to load the data from a file in the deployed zip. I have a tool that I worked on which adds a binary data blob to a class file, and updates the class file to reference the correct position and size of the added data.

sws26 · December 4, 2009, 1:21am

Actually strings might be the way to go, you probably just need to handle certain characters (such as surrogates) differently.

For instance the java compiler doesn’t like “\u0022” to appear in a string, treats it just like an un-escaped ".

davidc · December 4, 2009, 1:38am

EDIT: Just realized this is in the 4k section, so would most certainly not be a viable solution. It might be useful for > 4k projects though.

You could store the data as Base64 encoded string. The array:

[1,2,3,4,5,6,7,8,9,10]

would be represented as:

AQIDBAUGBwgJCg==

You would need to edit the array outside the source and then encode it before inserting in the source. You then need to implement a function in your code to decode it, so there is a performance overhead.

private static byte[] data = Base64.decode("AQIDBAUGBwgJCg==");

I have used this library in the past for encoding/decoding Base64 which works well:
http://iharder.sourceforge.net/current/java/base64/

pjt33 · December 4, 2009, 4:33pm

At the moment I’m tending to use strings but only store one byte per character. It means that for true binary data (not that I really have that) your UTF-8 encoding overhead is larger than if you use the full range of char but it avoids the problems with invalid characters.

Roquen · December 4, 2009, 4:48pm

I threw this together if anyone wants to muck with it. It creates a classfile like this:


public class D {
  public static final String D = "....";
}

where “…” is the passed in byte array. The assumption here is that an optimizer will be used to inline the String.

The “major” issue:

The UTF8 data chunk must be valid. So the byte array must be converted into/outof UTF8
for usage. Or some other encoding method. (SEE: 4.4.7 The CONSTANT_Utf8_info Structure) for details.


import java.io.*;
import java.lang.reflect.*;

/**
 * Create a class as:
 * <code>
 * public class D {
 *   public static final String D = "....";
 * }
 * </code>
 * where "...." contains the specified byte array.
 */

public class MakeData
{
  private static final int[] prefix = 
  {
    0xcafebabe, 0x00000032, 0x000e0a00, 0x03000b07,
    0x00040700, 0x0c010001, 0x44010012, 0x4c6a6176,
    0x612f6c61, 0x6e672f53, 0x7472696e, 0x673b0100,
    0x0d436f6e, 0x7374616e, 0x7456616c, 0x75650800,
    0x0d010006, 0x3c696e69, 0x743e0100, 0x03282956,
    0x01000443, 0x6f64650c, 0x00080009, 0x0100106a,
    0x6176612f, 0x6c616e67, 0x2f4f626a, 0x65637401 
  };

  private static final int[] postfix = 
  {
    0x00210002, 0x00030000, 0x00010019, 0x00040005,
    0x00010006, 0x00000002, 0x00070001, 0x00010008,
    0x00090001, 0x000a0000, 0x00110001, 0x00010000,
    0x00052ab7, 0x0001b100
  };


  public static void createClassfile(byte[] data)
  {
    byte[] cf = create(data);
    
    try {
      FileOutputStream out = new FileOutputStream("D.class");
      out.write(cf);
    }
    catch(FileNotFoundException e) {
      System.err.println("error: failed to open.");
    }
    catch (IOException e) {
      System.err.println("error: write failed.");
    }
    
  }

  
  /**
   *  Creates the classfile bytes.
   */
  public static byte[] create(byte[] data)
  {
    byte[] dst = new byte[data.length+168+2+5];
    int    j   = 0;
    
    for(int i = 0; i<prefix.length; i++) {
      int c    = prefix[i];
      dst[j  ] = (byte)(c >>> 24);
      dst[j+1] = (byte)(c >>> 16);
      dst[j+2] = (byte)(c >>>  8);
      dst[j+3] = (byte)(c       );
      j += 4;
    }

    // The length of the UTF8 chunk
    dst[j++] = (byte)(data.length >> 8);
    dst[j++] = (byte)(data.length);
    
    // Shove the data into the UTF8 chunk
    for(int i = 0; i<data.length; i++) {
      dst[j++] = data[i];
    }
    
    for(int i = 0; i<postfix.length; i++) {
      int c    = postfix[i];
      dst[j  ] = (byte)(c >>> 24);
      dst[j+1] = (byte)(c >>> 16);
      dst[j+2] = (byte)(c >>>  8);
      dst[j+3] = (byte)(c       );
      j += 4;
    }
    
    return dst;
  }

  
  // ALL TESTING BELOW HERE
  
  static class TestingLoader extends ClassLoader
  {
    byte[] data;
    
    public TestingLoader(byte[] data)
    {
      this.data = data;
    }
    
    @Override
    public Class<?> findClass(String name) 
    {
      byte[] b = loadClassData(name);
      return defineClass(name, b, 0, b.length);
    }

    private byte[] loadClassData(String name) 
    {
      return data;
    }
}

  
  public static void test(byte[] cf, byte[] src)
  {
    try {
      ClassLoader cl = new TestingLoader(cf);
      Class<?>    d  = cl.loadClass("D");
      String      f  = (String)d.getField("D").get(null);
      byte[]      out = new byte[f.length()];

      f.getBytes(0,f.length(),out,0);

      if (out.length == src.length) {
        for(int i = 0; i < out.length; i++) {
          if (src[i] == out[i])
            continue;
          System.err.println("broken");
        }
        return;
      }
      System.err.println("broken");
    }
    catch (Exception e)
    {
      System.err.println("problem");
      e.printStackTrace();
    }
  }
  
  public static void main(String[] args)
  {
    int n = 0x7e;
    byte[] src = new byte[n];
    
    for(int i = 0; i<n; i++)
      src[i] = (byte)(i+1);
    
    byte[] cf = create(src);
    
    createClassfile(src);
    
    test(cf,src);
    
    System.exit(0);
  }
}

Abuse · December 4, 2009, 6:13pm

Has anyone confirmed that pack200 makes the optimisation of embedding data inside the class file redundant?

Groboclown · December 4, 2009, 8:27pm

This has me thinking about optimal bytes. If you encode byte data directly into a UTF-8 string, you could encode it in your class file like so:


String BLOB = "" + (char) 0x20 + (char) 0x00 + (char) 0x22 + (char) 0xff;
...
byte[] data = BLOB.getBytes("ISO-8859-1");

(Someone would need to find out if the encoding is required here; I believe it is). However, any byte value > 127 will cause the UTF-8 string to grow more than storing 1 byte for that character (give or take).

To reduce this extra data usage, you could encode the data in essentially Base127. The decoding looks something like this:


        int maskA = 0xfe;
        int maskB = 0x1;
        int bits = 6;
        int spos = 0;
        for (int i = 0; i < EXPECTED_DECODED_DATA_SIZE; i++) {
            int c = EMBEDDED_DATA.charAt(spos++);
            DECODED_DATA[i] = (byte) ((c << (7 - bits)) & maskA |
                ((EMBEDDED_DATA.charAt(spos) >> (bits)) & maskB));
            if (--bits < 0) {
                bits = 6;
                maskA = 0xfe;
                maskB = 0x1;
                spos++;
            } else {
                maskA <<= 1;
                maskA &= 0xff;
                maskB <<= 1;
                maskB++;
            }
        }

Unless you have a really big data blob with lots of expanded UTF-8 characters, I’m imagining that straight-up String embedding would be smaller than including a decoding script like this (which has the added negative of including a new method call “charAt” as well).

If you go the approach for reading data that was added to the bytecode itself, there’s extra code overhead for retrieving that data as well.

YMMV, but it looks to me like String-encoded data may be the smallest approach.

Roquen · December 4, 2009, 9:10pm

Encoding 8-bit data:

[0x01,0x7F] = 1 byte

0 and [0x80,0xFF] = 2 bytes : 110:nnnnn, 10:nnnnnn

The two byte format (for 8-bit number: abcdefgh) -> 110000ab 10cdefgh (so zero is 11000000 10000000)

I’d expect this to compress relatively well (four 10-bit patterns).

sws26 · December 4, 2009, 9:52pm

Actually why not use chars and utf-16 encode the data. Java doesn’t seem to care about invalid surrogate pairs the only special case is for quote marks and that doesn’t have any effect on decoding. I think this is much more efficient than utf8.


// assume bytes is even in length
public static String encode(byte[] bytes){
	StringBuilder sb = new StringBuilder();
	for(int i = 0; i < bytes.length/2; i++){
		char c = (char)((bytes[i*2] & 0xFF) | ((bytes[i*2 + 1] & 0xFF) << 8));
		if (c == '"'){
			sb.append("\\\"");
		} else {
			String val = Integer.toHexString(c);
			int pad = 4 - val.length();
			for(int j = 0; j < pad; j++){
				val = "0" + val;
			}
			sb.append("\\u" + val);
		}
	}
	return "String data = \"" + sb.toString() + "\";";
}


public static byte[] decode(String str){
	byte[] bytes = new byte[str.length()*2];
	for(int i = 0; i < str.length(); i++){
		char c = str.charAt(i);
		bytes[i*2] = (byte)(c & 0xFF);
		bytes[i*2+1] = (byte)(c >> 8);
	}
	return bytes;
}

Here is what it looks like:
32 random bytes (hex) acb84fdf136f6c9c57486880104488abd9cd18d612ecf7c7cf33afd313ddc6
encoded:
String data = “\ub8ac\u0d4f\u130f\u6c6f\u579c\u6848\u1080\u8844\ud9ab\u18cd\u12d6\uf7ec\ucfc7\uaf33\u13d3\uc6dd”;

Abuse · December 5, 2009, 12:37am

You can put whatever String literal you like in the sourcecode, but when it’s written to the constants pool in the binary .class file it will be stored using modified UTF-8.
Consequently you’ll end up with it using 1, 2 or 3 bytes per character (or more, if you hit a surrogate pair!) - and given your example of using both the upper & lower 8 bits of the char, most will require 3+ bytes per input char.

I think the best* suggestion so far is to Base64 encode it.

It’ll expand your data by a factor of 4 BUT it’ll compress very well.
Decoding it is very cheap (in terms of code size)

*unless someone confirms that using pack200 infact renders this entire discussion pointless.

Groboclown · December 5, 2009, 6:16am

Is there a Base64 implementation in the public Java 1.5 API? I know that Sun has at least 2 implementations in their distribution, but they aren’t public.

On another note, perhaps a simple 16-bit character encoding scheme? Such as 4-bits + 8 bits per character? I’d need to go over the UTF-8 encoding scheme, but it would get you in worst case the same ratio as Mime encoding (3/4), but with a smaller decoding overhead cost, which is my biggest concern.

Roquen · December 5, 2009, 6:58am

A classfile is composed of (roughly) three parts: header, constant-pool and attributes and the verifier insures that a given classfile is valid at load-time. Since you can’t push any data into the header, this leaves the CP and attributes. For the CP, the only choice is strings, which are always encoded in UTF8 (as Abuse stated). The only way I can think-of to shove data into an attribute is via an annotation. Shoving a byte array into an annoation is bloated, so I don’t think this is an option.

The piece of code that I posted above is intended as a building tool which geneates a classfile (Of a class named “D” with a single public String named “D”). This is to seperate binary data from source during the development cycle.

All it does is shoves the raw data provided into the CP entry of String “D”, so it must be validly encoded. It’s up the the user to provide a valid encoding, of which directly as UTF8 is an option.

Directly encoding 8-bit data as UTF8 requires (on average) 1.5 bytes per byte, but as I stated above, the two byte encoding has exactly 4 10-bit prefixes and should compress well. So, for the UTF8 example, the decode source is:


  byte[] data = D.D.getBytes("UTF8");

WRT: Pack200. Pack200 is a front-end transform of one or more classes. By this I mean that it reorganizes the raw data into a form which is likely to compress better with a standard entropy compressor. The transforms applied to UTF8 entries really targets member signatures and full-qualified names. The pseudo-code directly from the spec (asserts deleted for length):


  int cursor     = 0;
  int big_cursor = 0;

  for (int i = 1; i < cp_Utf8_count; i++) {
    String thisString = cp_Utf8[i];

    int prefix = (i == 1)? 0: cp_Utf8_prefix[i-2];
    int suffix        = thisString.length() - prefix;
    String prevString = cp_Utf8[i-1];
    String prevPrefix = prevString.substring(0, prefix);
    String thisPrefix = thisString.substring(0, prefix);

    int small_suffix = cp_Utf8_suffix[i-1];
    char[] suffix_chars;
    int offset;

    if (small_suffix != 0) {
      suffix_chars = cp_Utf8_chars;
      offset       = cursor;
      cursor      += suffix;
    } else {
      suffix_chars  = cp_Utf8_big_chars[big_cursor];
      offset        = 0;
      big_cursor   += 1;
    }
    String thisSuffix = thisString.substring(prefix);
    String theseChars = new String(suffix_chars, offset, suffix);
  }

So, not of any interest for data encoded as a String.

Alan_W · December 7, 2009, 9:33pm

I’ve recoded Falcon4k to convert 5 chars into one 4 byte integer by coding each character as 32 + 0…94. This is slightly more efficient than using Base64 and has saved 224 bytes from my original jar which used an integer array initialisaton approach. However for the pack.gz version the conversion only saved 61 bytes.

Markus_Persson · December 7, 2009, 11:43pm

Abuse:

Steer well clear of arrays at all costs.
int[] blah = new int[]{0,1,2,3,4,5,6,7,8,9};
Is equivalent to:
int[] blah = new int[10];
{
blah[0] = 0;
blah[1] = 1;
blah[2] = 2;
blah[3] = 3;
blah[4] = 4;
blah[5] = 5;
blah[6] = 6;
blah[7] = 7;
blah[8] = 8;
blah[9] = 9;
}
Depending on the size of the array, that’s 2, 3 or even 4 additional bytes per element ontop of the actual data being stored!

Actually, it will skip “blah[0] = 0;”.

Abuse · December 8, 2009, 12:49am

Funny you should pick up on that; I initially had the values 1-based, but changed it to 0-based just before posting 'cos it looked ugly :-*

kevglass · December 8, 2009, 7:02am

Didn’t JBanes have some tools to handle embedding in magical ways at one point?

Kev