GZip stream oddness

Howdy

I’ve been unsing the GZIP input and output streams and have noticed that there’s occasionally some bytes remaining in the underlying stream after the desired data has been read. For example:


byte[] data = /*some bytes generated by my code*/;

System.out.println( data.length + " source bytes" );

ByteArrayOutputStream bo = new ByteArrayOutputStream();
GZIPOutputStream gzo = new GZIPOutputStream( bo );

gzo.write( data );

gzo.flush();
gzo.close();

byte[] zipped = bo.toByteArray();

ByteArrayInputStream bi = new ByteArrayInputStream( zipped );
GZIPInputStream gzi = new GZIPInputStream( bi );

byte[] unzipped = new byte[ data.length ];

int read = 0;
int count = 0;
do
{
	count += read;
	read = gzi.read( unzipped, count, unzipped.length - count );
}
while( read > 0 );

System.out.println( "unzipped equals source data? " + Arrays.equals( data, unzipped ) );
System.out.print( bi.available() + " bytes left in bi [ " );

while( bi.available() > 0 )
{
	System.out.print( ( byte ) bi.read() + ", " );
}
System.out.println( " ]" );

give the output


50293 source bytes
unzipped equals source data? true
6 bytes left in bi [ 36, -76, 117, -60, 0, 0,  ]

This behaviour is triggered by the input data (it’s reproducible but rare) and I think should be regarded as a bug - intermittently leaving unexpected bytes in the stream is very naughty.

So, am I doing something wrong, or is this behaviour documented somewhere?

I bet the last call to read() in this case has a ‘len’ argument of 0. And read() never returns -1. Try to change the while condition, because read() is allowed to return 0 if len is 0.

I don’t know the spec of the GZ format, but maybe it allows padding data?

It is odd behaviour, but you’re not reading the full stream, so anything may be left inthere, even if it wouldn’t yield any more bytes.

Yep, that was it. So, the code to fully read the stream now looks like



int read = 0;
int count = 0;
do
{
	count += read;
	read = gzi.read( unzipped, count, unzipped.length - count );
}
while( count < unzipped.length );

// we've got our data, now we have to finish the gzip stream
while( gzi.available() != 0 )
{
	gzi.read();
}

Kind of ugly, and means you have to treat gzip streams differently from others. Is there a better way? Reading a byte at a time would work, but I imagine it could be inefficient on some stream types.
Enforcing that len always be 1 or more results in an index out of bounds, and read never gets to -1 in the first loop.

Hm… I don’t see a problem…?

Just read your data into a ByteBufferOutputStream(expectedSize), until EOF.

Call baos.toByteArray().

done.


public static void transfer(InputStream in, OutputStream out) { ... };

GZipInputStream in = ...;
ByteArrayOutputStream out = new ByteArrayOutputStream(expectedSize);
transfer(in, out);
byte[] ungz = out.toByteArray();

GZip: 10 byte header (magic+version+timestamp), optional extra header (original file name+whatever), deflated data, 8 byte footer (crc32+length).

You could try InflaterInputStream and DeflaterOutputStream.

I reckon the root problem is the difference between the semantics of the InputStream and InflaterInputStream available() methods.
Input stream.available() returns the number of bytes left before blocking, while InflaterInputStream returns 1 if the end of stream has not yet been reached, even if there are no more bytes to be had.

When I ran into this I was deserialising an object graph that includes Other People’s Code (read: probably broken), so I thought I’d do a check with available() and see if there were bytes left over after deserialisation was apparently complete - which would indicate that somewhere bytes were being written but not read. This worked fine with normal streams, but would throw occasional errors with GZIP streams thanks to this difference.
At any rate, it all works now and serialisation sanity is being more thoroughly checked elsewhere.

Cheers all!