Converting floats/doubles to 10/11/16/N bit floats

GPUs often use smaller floats than 32 bits to avoid having to use a full 4 bytes per color channel. There are a number of common formats on GPUs, with 16-bit floats being the most common, but 10 and 11 bit formats are fairly common too. See this page for more info: https://www.opengl.org/wiki/Small_Float_Formats

There’s no native support for <32-bit floats in Java, but it can be really useful to be able to work with smaller float values. Here are some use case examples:

  • You can store vertex attributes as 16-bit floats, especially normals and many other attributes that don’t need a full 32-bit float to save a lot of space.
  • You can create 16-bit float texture data, or even R11F_G11F_B10F texture data offline and save it to a file without an OpenGL context, or something similar.
  • You can avoid some wasted memory bandwidth by reading back a 16-bit float texture in its native format and doing the unpacking on the CPU, although the driver may be faster at converting to 32-bit than my code…
  • Generally save memory when writing float binary data to files, as you can choose exactly how many bits to use for the exponent, the mantissa and even if you need a sign bit at all.

Storytime, the code is at the bottom =P
I first wrote a function to convert a 32-bit float to 16-bit floats and then back again using the Wikipedia specification, but then I realized that there are other float formats out there, so I decided to rework it a bit. I instead made two generic converter functions that take in a double value and converts it to a certain number of exponent and mantissa bits, with the sign being optional. Additionally, this also allowed me to test the system by using my functions to convert from 64-bit floats to 32-bit floats and compare that to a simple cast. So now I have a generic function that can handle any number of bits <=32, with a varying size mantissa and exponent for whatever needs you have.

Features

  • Denormals handled correctly for all bit counts.
  • Infinity/NaN preserved.
  • Clamps negative values to zero if the output value has no sign.
  • Values too big for the small format are rounded to infinity.
  • Values too small for the small format are rounded to 0.
  • Positive/negative zeroes preserved.
  • No dependencies.
  • Static functions for everything.
  • Shortcut methods for halfs, 11-bit and 10 bit floats.
  • Good performance (~50-100 million conversions per second).

Accuracy test
From my tests, converting doubles to 32-bit floats using my conversion function (and back again) provides 100% identical result as when doing a simple double–>float cast in Java (and back again). This test consisted of converting 18 253 611 008 random double bits to floats and back again, with 100% identical result to just casting. This should mean that the conversion is 100% accurate for 16-bit values as well, but this is harder to test.

Comments and suggestions are welcome.


public class FloatConversion {

	private static final int DOUBLE_EXPONENT_BITS = 11;
	private static final long DOUBLE_EXPONENT_MASK = (1L << DOUBLE_EXPONENT_BITS) - 1;
	private static final long DOUBLE_EXPONENT_BIAS = 1023;
	
	private static final long DOUBLE_MANTISSA_MASK = (1L << 52) - 1;
	
	public static long doubleToSmallFloat(double d, boolean hasSign, int exponentBits, int mantissaBits){
		
		long bits = Double.doubleToRawLongBits(d);
		
		long s = -(bits >>> 63);
		long e = ((bits >>> 52) & DOUBLE_EXPONENT_MASK) - DOUBLE_EXPONENT_BIAS;
		long m = bits & DOUBLE_MANTISSA_MASK;
		int exponentBias = (1 << (exponentBits-1)) - 1;
		
		if(!hasSign && d < 0){
			//Handle negative NaN and clamp negative numbers when we don't have an output sign
			if(e == 1024 && m != 0){
				return (((1 << exponentBits) - 1) << mantissaBits) | 1; //Negative NaN
			}else{
				return 0; //negative value, clamp to 0.
			}
		}
		
		
		
		long sign = s;
		long exponent = 0;
		long mantissa = 0;
		
		
		

		
		if(e <= -exponentBias){

			double abs = Double.longBitsToDouble(bits & 0x7FFFFFFFFFFFFFFFL);
			
			//Value is too small, calculate an optimal denormal value.
			exponent = 0;
			
			int denormalExponent = exponentBias + mantissaBits - 1;
			double multiplier = Double.longBitsToDouble((denormalExponent + DOUBLE_EXPONENT_BIAS) << 52);
			
			//Odd-even rounding
			mantissa = (long)Math.rint(abs * multiplier);
			
		}else if(e <= exponentBias){
			
			//A value in the normal range of this format. We can convert the exponent and mantissa 
			//directly by changing the exponent bias and dropping the extra mantissa bits (with correct
			//rounding to minimize the error).
			
			exponent = e + exponentBias;
			
			int shift = 52 - mantissaBits;
			long mantissaBase = m >> shift;
			long rounding = (m >> (shift-1)) & 1;
			mantissa = mantissaBase + rounding;

			//Again, if we overflow the mantissa due to rounding to 1024, we want to round the result to
			//up to infinity (exponent 31, mantissa 0). Through a stroke of luck, the code below
			//is not actually needed due to how the mantissa bits overflow into the exponent bits,
			//but it's here for clarity.
			//exponent += mantissa >> 10;
			//mantissa &= 0x3FF;
			
		}else{
			
			//We have 3 cases here:
			// 1. exponent = 128 and mantissa != 0 ---> NaN
			// 2. exponent = 128 and mantissa == 0 ---> Infinity
			// 3. value is to big for a small-float---> Infinity
			//So, if the value isn't NaN we want infinity.
			exponent = (1 << exponentBits) - 1;
			if(e == 1024 && m != 0){
				mantissa = 1; //NaN
			}else{
				mantissa = 0; //infinity
			}
		}
		
		if(hasSign){
			return (sign << (mantissaBits + exponentBits)) + (exponent << mantissaBits) + mantissa;
		}else{
			return (exponent << mantissaBits) + mantissa;
		}
		
	}
	
	public static double smallFloatToDouble(long f, boolean hasSign, int exponentBits, int mantissaBits){

		int exponentBias = (1 << (exponentBits-1)) - 1;

		long s = hasSign ? -(f >> (exponentBits + mantissaBits)) : 0;
		long e = ((f >>> mantissaBits) & ((1 << exponentBits) - 1)) - exponentBias;
		long m = f & ((1 << mantissaBits) - 1);

		long sign = s;
		long exponent = 0;
		long mantissa = 0;

		if(e <= -exponentBias){
			
			//We have a float denormal value. Cheat a bit with the calculation...

			int denormalExponent = exponentBias + mantissaBits - 1;
			double multiplier = Double.longBitsToDouble((DOUBLE_EXPONENT_BIAS - denormalExponent) << 52);
			
			return (1 - (sign << 1)) * (m * multiplier);

		}else if(e <= exponentBias){
			
			//We have a normal value that can be directly converted by just changing the exponent
			//bias and shifting the mantissa.
			
			exponent = e + DOUBLE_EXPONENT_BIAS;
			int shift = 52 - mantissaBits;
			mantissa = m << shift;
		}else{
			
			//We either have infinity or NaN, depending on if the mantissa is zero or non-zero.
			exponent = 2047;
			if(m == 0){
				mantissa = 0; //infinity
			}else{
				mantissa = 1; //NaN
			}
		}
		
		return Double.longBitsToDouble(((sign << 63) | (exponent << 52) | mantissa));
	}
	
	//Half floats
	
	public static short floatToHalf(float f){
		return (short) doubleToSmallFloat(f, true, 5, 10);
	}
	
	public static float halfToFloat(short h){
		return (float) smallFloatToDouble(h, true, 5, 10);
	}
	
	public static short doubleToHalf(double d){
		return (short) doubleToSmallFloat(d, true, 5, 10);
	}
	
	public static double halfToDouble(short h){
		return smallFloatToDouble(h, true, 5, 10);
	}
	
	
	//OpenGL 11-bit floats
	
	public static short floatToF11(float f){
		return (short) doubleToSmallFloat(f, false, 5, 6);
	}
	
	public static float f11ToFloat(short f){
		return (float) smallFloatToDouble(f, false, 5, 6);
	}
	
	public static short doubleToF11(double f){
		return (short) doubleToSmallFloat(f, false, 5, 6);
	}
	
	public static double f11ToDouble(short f){
		return smallFloatToDouble(f, false, 5, 6);
	}
	
	
	//OpenGL 10-bit floats.
	
	public static short floatToF10(float f){
		return (short) doubleToSmallFloat(f, false, 5, 5);
	}
	
	public static float f10ToFloat(short f){
		return (float) smallFloatToDouble(f, false, 5, 5);
	}
	
	public static short doubleToF10(double f){
		return (short) doubleToSmallFloat(f, false, 5, 5);
	}
	
	public static double f10ToDouble(short f){
		return smallFloatToDouble(f, false, 5, 5);
	}
}

thanks for sharing :slight_smile:

Just look at all that bollocks which will become mercifully obsolete in just a couple of years’ time :slight_smile:

It’s great code, but just a reminder to me of just how pointlessly annoying programming around hardware limitations can be.

Cas :slight_smile:

Just a quick question: did you learn all of this at university? I haven’t gotten to university yet, so I wouldn’t know. Where did you learn all of this? Lol I’m getting desperate :stuck_out_tongue:

princec: memory footprint.

We did have a lecture on two on how floating point numbers work at uni, but I just looked up the specifications of the different values on Wikipedia.

Yeah, the point here is to halve the bandwidth and memory usage.

Indeed that would be the point… my take on it is just to go “meh” and wait for the hardware to catch up so we don’t have to have to worry about pages and pages of this sort of thing in our codebases, which when you get right down to it, are just horrible incomprehensible hacks to work around hardware deficiencies. It’s nice and all but I do look forward to a time when none of this is necessary.

Cas :slight_smile:

@Cas: the hardware will never catch up, because the hardware will never be fast enough.

Even in hardware 20 years from now, when we have realtime photon-mapping, memory bandwidth will be a bottleneck.
Any way or form to halve your data size is bound to yield performance gains.

The only out here is some unforeseen engineering miracle.

This paper has an old graph of the gap: http://gec.di.uminho.pt/discip/minf/ac0102/1000gap_proc-mem_speed.pdf

Related: https://fgiesen.wordpress.com/2016/08/07/why-do-cpus-have-multiple-cache-levels/