Compressing chunk of tile data?

theagentd · June 23, 2017, 5:34pm

I need to compress a 32x32 chunk of tile data. The data consists of 1 byte for the terrain type, and 1 short for a detail tile (second layer essentially). The terrain tiles are very continuous, while most detail tiles are 0 (null). What would a good compression method?

Here’s my current idea:

Reorder data into z-curve order or Hilbert curve order to improve spatial coherency
Run length encode data using a special repeat byte/short value (say Byte/Short.MAX_VALUE) followed by number of repetitions, like this:
baaaaab —> ba*5b
Compress data using a precomputed Hilbert code. The idea is to precompute several Hilbert trees from statistical data from a lot of worlds based on the two most common tiles in the chunk, so there’ll be one tree for each biome essentially. The chunk encodes a byte for which tree to use, then the encoded data. The RLE symbol is encoded in the tree, but the RLE count could be encoded using a different huffman code tree specifically for RLE counts.

My question is: Would this beat the DEFLATE compression I’m currently using? Is it even worth trying to implement this? I feel like DEFLATE should have problems compressing small chunks of data like this as it tries to learn about the data as it goes. Also, would it make sense to use delta encoding here?

KaiHH · June 23, 2017, 6:08pm

Make it a contest here!

Provide 4-5 example “byte arrays” of your tile chunk data and maybe a simple test fixture that calls a user-implemented compress(InputStream in, OutputStream out) and prints the compression ratio and then calls the other user-implemented decompress(InputStream is, OutputStream out) to validate with the initial data.

theagentd · June 23, 2017, 6:43pm

That’s actually a pretty good idea! I’m currently in Italy on vacation though, but I’ll post a proper challenge when I get back!

KaiHH · June 26, 2017, 8:16pm

I’ve thought about compression a bit more. Delta encoding together with a hand-crafted LZ77 variant (really not that hard to do) is probably a good option for you, under the following assumptions:

within a “scanline” or a Morton/Hilbert path there are long runs of constant data (-> run-length encoding would shine here)
if there is variance it changes with more or less constant frequency so the first order derivative would be more or less constant (-> delta encoding would shine here which in turn makes LZ77 perform well, too) (to handle positive as well as negative derivatives you can zigzag-encode the deltas to get all numbers positive again)
if there are not many “runs” of the same tile type (or first order derivative) there might be recurring patterns (which again would be great for LZ77)

These are also the assumptions being made by the PNG compression algorithm, which is basically delta encoding plus deflate (LZ77 + Huffman).

Now you can tune a custom LZ77 implementation to be more suited for run-length encoding as opposed to pattern matching by altering the coding scheme used for the <offset, length> pairs and literals, both in how many bits to allocate for each of that information and especially how you entropy-encode these bits.
If you expect there to be lots of runs of the same data (tile numbers or derivative) and not so much patterns found in some distances to each another, then first use the popular LZ77 scheme which uses lengths longer than the offset to indicate repetition and therefore enables runlength encoding, and then Huffman-encode the offset using for example 2 bits for the “offset = 1” case (i.e. directly following the current encoder position in the LZ77 lookahead buffer) and a 3-bit extra-bits code with 2-3 extra bits for the offset up to 13 bytes behind in the LZ77 window. Then probably a 1-bit extra-bit code with 2-3 extra bits for the length of the run.

For example, using this scheme:
`

Encoding of literal

‘0’ + <-- first 0 bit indicates “now comes a literal byte” (like LZSS does)

Encoding of offsets (with always a first 1 bit to indicate “now comes <offset, length> and not literal”)

1 = ‘10’ <-- we expect large single-byte runs so keep it the most common and shortest case
2-5 = ‘110’ + 2 extra bits
6-13 = ‘111’ + 3 extra bits <-- we don’t want to find recurring patterns that far behind

Encoding of lengths

3-6 = ‘0’ + 2 extra bits <-- here is where we can tweak things, too…
7-14 = ‘1’ + 3 extra bits <-- …if we expect longer runs to be more common, we just swap the intervals
`
This is to be efficient to represent moderately long runs of data very close to the current encoding position, but rather inefficient for patterns that are further behind in the LZ77 window (while hopefully still being more efficient than not caring about patterns at all and just implementing run-length coding).
Essentially, what we want here as well is to avoid having to encode a (canonical) Huffman tree in the data as well, since the 32x32 tiles data is rather very short in total.

Next, I would encode the tile type/number completely separately from the “second layer” which you mentioned. If the latter is more constant we can make use of that by first encoding the tile type/number of all 32x32 tiles which we expect to have a higher entropy than the second layer, and then encode the second layer of all 32x32 tiles with an encoding scheme that favors low-entropy long runs even more, probably by just using run-length coding.

But we won’t know what works unless there is some example data.

KaiHH · June 27, 2017, 11:54pm

I pushed a small and working implementation of an arithmetic coder, which you can use as a basis for a final coder.
Arithmetic coding is superior to Huffman when your probabilities are not powers of 1/2, which is almost always the case.
It uses 64-bit integers to represent symbols of any alphabet (so you can even use delta with zigzag encoding and encode those the [0-510] range), and you can plug a probability distribution model which directly reflects your <tile type, second layer, …> values to implement different probabilities for the two values. See the “RoundRobinModel” class.

https://github.com/httpdigest/compress

ddyer · June 28, 2017, 12:12am

If you just use a standard compression algorithm such as LZW, you’ll come very close
to that you can achieve by hand-crafting something of your own. Maybe even better.
And you’ll have more time to work on your game instead of reinventing the wheel.

Spasi · June 28, 2017, 7:16am

On the topic of compression, but not necessarily applicable to this use-case, I’ve been meaning to add a few different implementations to LWJGL. Specifically, I’m interested in compression algorithms that make unusual space/speed trade-offs. For example, Zstandard and LZFSE. Suggestions for other algorithms are welcome. I’m eagerly looking for an rANS-based implementation, like cbloom’s Kraken, that is open source.