Creating a chunk-based custom binary file format

I wanted to try learning how to create a custom chunk-based binary file. I wrote a small tree structure that implements chunk hierarchy in my documentation:

As I started to work on this, I noticed that I can’t come up with a good way to write chunks into byte arrays, then output the byte arrays using DataOutputStream.

I do know that chunk-based binary files contains chunk headers that can be used as indexes in a book. You skip parts of the file until you reach the chunk you wanted, and read in the data from there.

Can the experts teach me how to write chunk-based binary files using DataOutputStream? Thanks in advance.

You’re going to have to be more specific; what is the problem?

All I can suggest is that you look at the file format specification for a popular chunk based format, I’d suggest PNG (section 4.7 onwards); it’s straight forward and well documented.

I created 2 classes, Chunk (superclass) and Header (subclass). Both of these are used to test out the creation of a custom chunk-based binary file. My plan is to just write the Header into the created file, and then load from the created file. This is to test and see if reading chunks and writing chunks are working or not.

Chunk:


package saving;

import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.IOException;

public class Chunk {
	public static final long HEADER_SIGNATURE = 0x4845414445520000L; //ASCII code for HEADER00 (2 zeros at end)
	public static final int SIGNATURE_COUNT = 1;
	
	protected long signature; 	//8 bytes of readable tag label.
	protected int size;			//Size of chunk.
	protected long[] chunkSignatureList; //Array of signatures that the file has saved.
	
	public Chunk() {
		this.signature = 0L;
		this.size = 0;
		this.chunkSignatureList = new long[SIGNATURE_COUNT]; //This number is the total number of signatures.
	}
	
	public Chunk(Chunk chunk) {
		this.signature = chunk.getSignature();
		this.size = chunk.getSize();
	}
	
	public void read(DataInputStream input) {
		try {
			this.signature = input.readLong();
			this.size = input.readInt();
		}
		catch (IOException e) {
			throw new RuntimeException("Error in chunk reading the input", e);
		}
	}
	
	public void write(DataOutputStream output) {
		try {
			output.writeLong(signature);
			output.writeInt(size);
		}
		catch (IOException e) {
			throw new RuntimeException("Error in chunk writing the input", e);
		}
		
	}
	
	public long[] getChunkSignatureList() {
		return chunkSignatureList;
	}
	
	public long getSignature() {
		return signature;
	}
	
	public void setSignature(long sig) {
		this.signature = sig;
	}
	
	public int getSize() {
		return size;
	}
	
	public void setSize(int size) {
		this.size = size;
	}
	
	public void setChunkSignatureList(long[] list) {
		this.chunkSignatureList = list;
	}
	
	public static Chunk convert(Chunk chunk) {
		long sig = chunk.getSignature();
		if (sig == HEADER_SIGNATURE)
			return new Header((Header) chunk);
		return null;
	}
}


Header:


package saving;

import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.IOException;

public class Header extends Chunk {
	private long fileTypeName;		//8 bytes containing the name of the program that reads this file.
	private int fileFormat;			//1 byte containing a period. 3 bytes containing the format name.
	
	public Header() {
		super();
		this.signature = HEADER_SIGNATURE;
		this.fileTypeName = 0L;
		this.fileFormat = 0;
	}
	
	public Header(Header header) {
		super(header);
		this.signature = HEADER_SIGNATURE;
		this.fileTypeName = header.getFileTypeName();
		this.fileFormat = header.getFileFormat();
		this.chunkSignatureList = header.getChunkSignatureList();
	}
	
	@Override
	public void read(DataInputStream input) {
		super.read(input);
		try {
			this.fileTypeName = input.readLong();
			this.fileFormat = input.readInt();
		}
		catch (IOException e) {
			throw new RuntimeException("Can't read header.", e);
		}
	}
	
	@Override
	public void write(DataOutputStream output) {
		super.write(output);
		try {
			output.writeLong(this.fileTypeName);
			output.writeInt(this.fileFormat);
		}
		catch (IOException e) {
			throw new RuntimeException("Can't write header.", e);
		}
	}
	
	public long getFileTypeName() {
		return this.fileTypeName;
	}
	
	public int getFileFormat() {
		return this.fileFormat;
	}
	
	public void setFileTypeName(long name) {
		this.fileTypeName = name;
	}
	
	public void setFileFormat(int format) {
		this.fileFormat = format;
	}
}

  • When you start writing to a file, where do you create the chunks first, before writing it?
  • What needs to be done before reading it back in?
  • I have heard of using the interface [icode]Serialization[/icode], is this good to use?

I wouldnt go through the hassle of creating a custom binary format. Just use xstream and save xml or json directly from your data model: http://xstream.codehaus.org/tutorial.html

Maybe each chunk contains the tag of the current chunk, the tag of the next chunk, and the size of the current chunk. The DataInputStream should read the tags of the current chunk, check to see if it matches the signature that is required. If not, go read in the next tag, then skip the current chunk via the given size of the current chunk, and then repeat by matching the required signature with the next tag that was last read in.

The only problem is that I don’t know how to tell the DataInputStream to go jump back and forth between the chunks, each time checking the tags until the signature was found, and read in that chunk. Unlike in C, C++, where given an integer pointer, one can just tell the pointer to go back X amount of bytes in a byte array (or set the pointer to the first element of a byte array) and repeat the steps to check on the tag.

Even though it really is a hassle, it is something that I need to conquer in the near future. More like skipping out from using libraries.

My experience is that it’s more hassle to create a so-called human readable format using a library (or not) than a simple custom binary format.

To OP: I’d say there are two types of file formats: Interchange and custom. The former is for allowing multiple people to write programs than can handle data it knows (or simply cares) about and ignore the rest. The latter is only used by one program and any associated tools. Why are you writing an interchange format?

It’s the first thing that came to my mind when I started on planning out the file format. I didn’t know much about it, other than that it looks nice for a documentation.

I actually tried to figure out how to load that sort of “interchange” file format with the following code. I haven’t tested it out, or do anything with it other than following along with my “logic”.


	public static void load(Game game, String filename) {
		File load = new File(filename);
		if (load.isFile()) {
			int chunkLoadedCount = 0;
			byte[] buffer = null;
			ArrayList<byte[]> buffers = new ArrayList<byte[]>();
			DataInputStream input = null;
			try {
				//TODO: Modify this number so that it can load even more chunks.
				while (chunkLoadedCount < 1) {
					try {
						input = new DataInputStream(new BufferedInputStream(new FileInputStream(load)));
						
						while (true) {
							//This loop allows the possibility to read each "chunk" and take actions depending on the "chunk's" signature.
							//If it fails, it redo the file loading until the total count of loaded chunks have reached its goal.
							try {
								int sig = input.readInt();
								int size = input.readInt();
								
								//TODO: Create a new signature tag. There won't be any "chunks" structures.
								if (sig != 0xA1A1A1A1) {
									input.skip(size);
								}
								else {
									buffer = new byte[size];
									input.read(buffer);
									buffers.add(buffer);
									chunkLoadedCount++;
								}
							}
							catch (Exception e) {
								throw new Exception(e);
							}
						}
					}
					catch (Exception e) {
						break;
					}
				}
			}
			catch (FileNotFoundException e) {
				throw new RuntimeException("Something with the file not being found.", e);
			}
			catch (IOException e) {
				throw new RuntimeException("Something with the file not being read correctly.", e);
			}
			finally {
				try {
					input.close();
				}
				catch (IOException e) {
					throw new RuntimeException("Unable to close the save file correctly.", e);
				}
			}
			handleLoadedBuffers(game, buffers);
		}
	}

  1. To Roquen, what can you suggest me to do? I love hearing some recommendations on this.
  2. To everyone, would the code above work out for the file format I had planned before Roquen’s post?

Screw the file format!

Replace the lot with an embedded HSQLDB engine using cached tables. One database per “world”. That way the bulk of your file format is managed by HSQLDB, and also the indexing and access.

Cas :slight_smile:

Is there any specific reason why you want to use DataInputStream over RandomAccessFile? In RandomAccessFile you have a method seek() which sets the pointer in the files to anywhere you want.

Do I really have to implement a SQL-variant database manager into my game? I felt like that is overkill.

Really? Oooh~~! I didn’t know that.

I’m a DB idiot so the comment of Cas might as well be written in <fill in the name of some dialect that hasn’t been spoke in a thousand years>.

One potential advantage of interchange style formats can be evolution of the file format. So not version N vs. N+1, but you’re at N and working on what will be in N+1. I don’t find so, but they do have some merit here.

One possible style simply adapting what you already have would be something like this (in-post, probably doesn’t even compile).


  public class SaveGame {
    DataInputStream src;
    int version;
    Game game;
    
    // for sanity checking and debugging
    void checkTag(int expected, String error) throws Exception
    {
      int tag = src.readInt();
      
      if (tag == expected)
        return;
      
      throw new Exception("whatever" + error); // whatever exception type
    }
    
    public static final int HEADER = 0x00000000; // whatever
    
    public static void read(Game game, String filename) throws Exception
    {
      File load = new File(filename);
      SaveGame sg = new SaveGame();
      
      if (load.isFile()) {
        sg.src     = new DataInputStream(new BufferedInputStream(new FileInputStream(load)));
        sg.checkTag(HEADER, "whatever");
        sg.version = sg.src.readInt();
        sg.game    = game;
        Game.read(sg);
      }
    }
    
    // mock classes
    
    public static class Game {
      public static void read(SaveGame sg) throws Exception
      {
        Player.read(sg);
        Entity.read(sg);
        // whatever else in version 1
        
        if (sg.version > 1) {
         // read in all extra stuff in version 2, etc
        }
      }
    }
    
    public static class Player {
      private static final int TAG = 0x00000000; // whatever
      
      public static void read(SaveGame sg) throws Exception
      {
        sg.checkTag(TAG, "player");
        // read in all stuff in version 1
        
        if (sg.version > 1) {
          // read in all extra stuff in version 2, etc
        }
      }
    }
    
    
    public static class Entity {
      private static final int TAG = 0x00000000; // whatever
      
      public static void read(SaveGame sg) throws Exception
      {
        sg.checkTag(TAG, "entity");
        // read in all stuff in version 1
        
        if (sg.version > 1) {
          // read in all extra stuff in version 2, etc
        }
      }
    }
  }

I dont want to push this since the OP explicitely wants a custom binary format, but I dont get your point. Whats the hassle in serializing a selfcontained datamodel to any format using a library like xstream. Sure you need to define your datamodel, the used types, values and references, with care, but how is that more of a hassle than defining a binary format?

The reasons to do it are:

  1. It works
  2. It does what you want
  3. It takes care of the binary format for you
  4. It provides an upgrade path for your metadata (DDL - alter table)
  5. It adds a bunch of handy extra stuff like ACID - much more agreeable than trashing your data after an exception in the middle of a write, no?
  6. The jar is small
  7. It’s easy to use

Cas :slight_smile:

@cylab: As always stuff like this is use-case dependent. I very rarely have a one-to-one mapping of memory representation vs. storage. Simple examples: Data elements which I want to be lossless might require extra work on my part to be lossless in a “human-readable” format. Like floating point elements. Others I’m happy (or even desirable) to have a lossy storage.

@Roquen:

So, interchange formats are like formats that evolve over time? Seems applicable for an in-dev project, right?

Hm…

rubs moustache and chin heavily

Could consider it for next project. :smiley: