new serialization classes

Nate · May 21, 2011, 11:39am

I’ve had a binary serialization project called Kryo for a while now. Kryo has the idea of a “serializer” which, for a specific class, knows how to go from an object to bytes and vice versa. Kryo also has a way to register serializers for a type. Using these two ideas, serializers can be written to serialize any object graph in various ways. Eg, FieldSerializer uses reflection to do serialization in the same way you would if you were hand writing code using DataInputStream/DataOutputStream.

Kryo originally started as a contribution to the JGN library and later got rewritten for KryoNet. Both are NIO networking libs, so Kryo uses ByteBuffer to write/read to/from. This causes problems when trying to do serialization with streams. To read from a stream, it has to be copied into a ByteBuffer, at which point you aren’t really streaming any more. To compound this problem, ByteBuffers do not grow, so you must allocate a large enough buffer beforehand. I would like to rewrite Kryo and remedy these problems (as well as a few other unrelated issues).

Most of the time you don’t need random access for serialization and a stream works just fine. However, occasionally you need it, eg to write a length and then some data. If an object’s field is compressed with deflate, you might want to compress the data for the field, write out the number of compressed bytes, then write out the compressed bytes. You might do this by skipping a few bytes and then coming back later and filling in the length. This isn’t possible with a stream, so some sort of buffering is required. Another way might be to write the serialization and compression of the field value to a separate buffer and then copy it. A copy isn’t ideal and neither is having many (potentially very large) scratch buffers.

I came up with an idea for a hybrid stream/buffer monster. A class called WriteBuffer has methods similar to DataOutputStream and writes bytes to a byte array. When the byte array is filled, it is processed (eg, written to an OutputStream) and then reused. You can set any number of “marks”, write some bytes, and then later jump back to a mark. A mark prevents the byte array from being processed and reused. Instead, if the byte array is filled while a mark is set, an additional byte array will be allocated. Once the marks are used, the filled byte arrays are processed and reused. This ArrayList<byte[]> mechanism for growing the buffer without a copy is used in PyroNet and Netty, WriteBuffer just adds processing/marks to the idea.

WriteBuffer itself doesn’t actually process the bytes arrays, it will just keep allocating more byte arrays as they are filled. OutputStreamBuffer extends WriteBuffer and writes the byte arrays to an OutputStream as they are filled. If no marks are ever set, a single byte array will be continuously reused and the OutputStreamBuffer acts like a BufferedOutputStream. NIOWriteBuffer extends WriteBuffer and writes the byte arrays to a ByteBuffer (a better name is welcome… ByteBufferBuffer? maybe rename WriteBuffer?).

Another class called ReadBuffer is similar to DataInputStream and reads from a byte array. When the byte array has been completely read, it is refilled and reused for more reading. Similar to WriteBuffer, marks can be set, so instead of the byte array being refilled and reused, a new byte array will be allocated and filled. This allows you to jump back and read the same bytes again.

ReadBuffer itself uses a single byte array as a data source. InputStreamBuffer extends ReadBuffer and fills the bytes arrays using an InputStream. NIOReadBuffer extends ReadBuffer and fills the bytes arrays using a ByteBuffer.

I’ve written some code for the above, you can check it out here:
http://code.google.com/p/kryo/source/browse/#svn%2Ftrunk%2Fsrc%2Fcom%2Fesotericsoftware%2Fkryo2
All the files in that package are the new stuff, files in other packages are the old Kryo stuff. No javadocs yet!

Do you guys have any feedback on this buffer/stream idea? Maybe there are alternate/better solutions? I want to make sure I’m headed in a decent direction before going off and building lots of stuff on top of this foundation.

Riven · May 21, 2011, 11:59am

The problem of writing ‘packets’ of undetermined length has been long solved: chunked encoding.

You basically prepend a ‘length’ integer before any write you do. A length of 0 means you reached the end of the ‘packet’.

So instead of this:


mark()
write(byte[35])
write(byte[1235])
write(byte[783])
write(byte[234])
back().writeInt(2287).forward()

You’d do:


writeInt(35)
write(byte[35])
writeInt(1235)
write(byte[1235])
writeInt(783)
write(byte[783])
writeInt(234)
write(byte[234])
writeInt(0)

Surely there is some overhead, but it allows for unbuffered streaming (both reading and writing) of unknown-length packets.

Needless to say, you can easily nest this encoding.

Nate · May 23, 2011, 11:33am

Thanks Riven! You’re right, I can use chunked encoding where I need to first write a length and use a stream everywhere else. Currently I have KryoByteArrayInputStream/KryoByteArrayOutputStream that are similar to DataInputStream/DataOutputStream but with more serialization methods and they work on a byte array rather than wrap another stream. I extend these to make KryoInputStream/KryoOutputStream and KryoByteBufferInputStream/KryoByteBufferOutputStream that fill or empty the byte[] to another stream or ByteBuffer as needed. Buffering the data in the byte[] makes it fast.