GPUs often use smaller floats than 32 bits to avoid having to use a full 4 bytes per color channel. There are a number of common formats on GPUs, with 16-bit floats being the most common, but 10 and 11 bit formats are fairly common too. See this page for more info: https://www.opengl.org/wiki/Small_Float_Formats

There’s no native support for <32-bit floats in Java, but it can be really useful to be able to work with smaller float values. Here are some use case examples:

- You can store vertex attributes as 16-bit floats, especially normals and many other attributes that don’t need a full 32-bit float to save a lot of space.
- You can create 16-bit float texture data, or even R11F_G11F_B10F texture data offline and save it to a file without an OpenGL context, or something similar.
- You can avoid some wasted memory bandwidth by reading back a 16-bit float texture in its native format and doing the unpacking on the CPU, although the driver may be faster at converting to 32-bit than my code…
- Generally save memory when writing float binary data to files, as you can choose exactly how many bits to use for the exponent, the mantissa and even if you need a sign bit at all.

Storytime, the code is at the bottom =P

I first wrote a function to convert a 32-bit float to 16-bit floats and then back again using the Wikipedia specification, but then I realized that there are other float formats out there, so I decided to rework it a bit. I instead made two generic converter functions that take in a double value and converts it to a certain number of exponent and mantissa bits, with the sign being optional. Additionally, this also allowed me to test the system by using my functions to convert from 64-bit floats to 32-bit floats and compare that to a simple cast. So now I have a generic function that can handle any number of bits <=32, with a varying size mantissa and exponent for whatever needs you have.

Features

- Denormals handled correctly for all bit counts.
- Infinity/NaN preserved.
- Clamps negative values to zero if the output value has no sign.
- Values too big for the small format are rounded to infinity.
- Values too small for the small format are rounded to 0.
- Positive/negative zeroes preserved.
- No dependencies.
- Static functions for everything.
- Shortcut methods for halfs, 11-bit and 10 bit floats.
- Good performance (~50-100 million conversions per second).

Accuracy test

From my tests, converting doubles to 32-bit floats using my conversion function (and back again) provides 100% identical result as when doing a simple double–>float cast in Java (and back again). This test consisted of converting 18 253 611 008 random double bits to floats and back again, with 100% identical result to just casting. This should mean that the conversion is 100% accurate for 16-bit values as well, but this is harder to test.

Comments and suggestions are welcome.

```
public class FloatConversion {
private static final int DOUBLE_EXPONENT_BITS = 11;
private static final long DOUBLE_EXPONENT_MASK = (1L << DOUBLE_EXPONENT_BITS) - 1;
private static final long DOUBLE_EXPONENT_BIAS = 1023;
private static final long DOUBLE_MANTISSA_MASK = (1L << 52) - 1;
public static long doubleToSmallFloat(double d, boolean hasSign, int exponentBits, int mantissaBits){
long bits = Double.doubleToRawLongBits(d);
long s = -(bits >>> 63);
long e = ((bits >>> 52) & DOUBLE_EXPONENT_MASK) - DOUBLE_EXPONENT_BIAS;
long m = bits & DOUBLE_MANTISSA_MASK;
int exponentBias = (1 << (exponentBits-1)) - 1;
if(!hasSign && d < 0){
//Handle negative NaN and clamp negative numbers when we don't have an output sign
if(e == 1024 && m != 0){
return (((1 << exponentBits) - 1) << mantissaBits) | 1; //Negative NaN
}else{
return 0; //negative value, clamp to 0.
}
}
long sign = s;
long exponent = 0;
long mantissa = 0;
if(e <= -exponentBias){
double abs = Double.longBitsToDouble(bits & 0x7FFFFFFFFFFFFFFFL);
//Value is too small, calculate an optimal denormal value.
exponent = 0;
int denormalExponent = exponentBias + mantissaBits - 1;
double multiplier = Double.longBitsToDouble((denormalExponent + DOUBLE_EXPONENT_BIAS) << 52);
//Odd-even rounding
mantissa = (long)Math.rint(abs * multiplier);
}else if(e <= exponentBias){
//A value in the normal range of this format. We can convert the exponent and mantissa
//directly by changing the exponent bias and dropping the extra mantissa bits (with correct
//rounding to minimize the error).
exponent = e + exponentBias;
int shift = 52 - mantissaBits;
long mantissaBase = m >> shift;
long rounding = (m >> (shift-1)) & 1;
mantissa = mantissaBase + rounding;
//Again, if we overflow the mantissa due to rounding to 1024, we want to round the result to
//up to infinity (exponent 31, mantissa 0). Through a stroke of luck, the code below
//is not actually needed due to how the mantissa bits overflow into the exponent bits,
//but it's here for clarity.
//exponent += mantissa >> 10;
//mantissa &= 0x3FF;
}else{
//We have 3 cases here:
// 1. exponent = 128 and mantissa != 0 ---> NaN
// 2. exponent = 128 and mantissa == 0 ---> Infinity
// 3. value is to big for a small-float---> Infinity
//So, if the value isn't NaN we want infinity.
exponent = (1 << exponentBits) - 1;
if(e == 1024 && m != 0){
mantissa = 1; //NaN
}else{
mantissa = 0; //infinity
}
}
if(hasSign){
return (sign << (mantissaBits + exponentBits)) + (exponent << mantissaBits) + mantissa;
}else{
return (exponent << mantissaBits) + mantissa;
}
}
public static double smallFloatToDouble(long f, boolean hasSign, int exponentBits, int mantissaBits){
int exponentBias = (1 << (exponentBits-1)) - 1;
long s = hasSign ? -(f >> (exponentBits + mantissaBits)) : 0;
long e = ((f >>> mantissaBits) & ((1 << exponentBits) - 1)) - exponentBias;
long m = f & ((1 << mantissaBits) - 1);
long sign = s;
long exponent = 0;
long mantissa = 0;
if(e <= -exponentBias){
//We have a float denormal value. Cheat a bit with the calculation...
int denormalExponent = exponentBias + mantissaBits - 1;
double multiplier = Double.longBitsToDouble((DOUBLE_EXPONENT_BIAS - denormalExponent) << 52);
return (1 - (sign << 1)) * (m * multiplier);
}else if(e <= exponentBias){
//We have a normal value that can be directly converted by just changing the exponent
//bias and shifting the mantissa.
exponent = e + DOUBLE_EXPONENT_BIAS;
int shift = 52 - mantissaBits;
mantissa = m << shift;
}else{
//We either have infinity or NaN, depending on if the mantissa is zero or non-zero.
exponent = 2047;
if(m == 0){
mantissa = 0; //infinity
}else{
mantissa = 1; //NaN
}
}
return Double.longBitsToDouble(((sign << 63) | (exponent << 52) | mantissa));
}
//Half floats
public static short floatToHalf(float f){
return (short) doubleToSmallFloat(f, true, 5, 10);
}
public static float halfToFloat(short h){
return (float) smallFloatToDouble(h, true, 5, 10);
}
public static short doubleToHalf(double d){
return (short) doubleToSmallFloat(d, true, 5, 10);
}
public static double halfToDouble(short h){
return smallFloatToDouble(h, true, 5, 10);
}
//OpenGL 11-bit floats
public static short floatToF11(float f){
return (short) doubleToSmallFloat(f, false, 5, 6);
}
public static float f11ToFloat(short f){
return (float) smallFloatToDouble(f, false, 5, 6);
}
public static short doubleToF11(double f){
return (short) doubleToSmallFloat(f, false, 5, 6);
}
public static double f11ToDouble(short f){
return smallFloatToDouble(f, false, 5, 6);
}
//OpenGL 10-bit floats.
public static short floatToF10(float f){
return (short) doubleToSmallFloat(f, false, 5, 5);
}
public static float f10ToFloat(short f){
return (float) smallFloatToDouble(f, false, 5, 5);
}
public static short doubleToF10(double f){
return (short) doubleToSmallFloat(f, false, 5, 5);
}
public static double f10ToDouble(short f){
return smallFloatToDouble(f, false, 5, 5);
}
}
```