On magnitude and precision of floating-point values (2)

In the last article we saw the principles of the floating-point representation of IEEE 754.

We saw the motivation behind having floating-point values compared to fixed-point values (such as integers), which was to deal with both very big (in magnitude) values as well as very small values.

Also, we saw what magnitude and precision means in that context, with magnitude being how absolutely “far away from zero” the value is and precision being how many different values we can represent around a given order of magnitude.

Without losing precision, we saw that we can “move the decimal point around” in order to change the magnitude of our floating-point value.
Having done that we effectively just multiplied or divided our floating-point value by any power of the base.
This is what the exponent in the IEEE 754 representation does. It takes the significant digits (i.e. the mantissa) and multiplies that by the base raised to the power of the desired exponent, in order to come up with the number we want to represent.

Window of Magnitudes

To make things more clear here, we can use a scale to show that in effect we have a “window” of possible values at any order of magnitude that we can represent using floating-point values. And we can control the location of the window by using the exponent:


//-----------------~-------------------~-----------------------
//  ^    ^    ^    ~    ^    ^    ^    ~    ^     ^     ^
// 1E2  2E2  3E2   ~   1E9  2E9  3E9   ~   1E16  2E16  3E16 ...

(I kinda misused the code block here, to have a monotype font)

This scale shows some scalar values. The magnitude of the values on this scale increase to the right.
On that scale you see at certain points some values pin-pointed using their scientific representations.
And you also see kinda three sections on that scale.
On the left section we have some values at the order of magnitude of hundreds (100, 200 and 300).
In the middle section we have values at the order of a billion (that number with 9 zeroes ;).
And on the right even bigger values with more zeroes, I do not know the name of.
So, the three sections are not equally spaced, but that serves us just right here to get to the point.
Between the three sections we have a sign ‘~’ denoting that there are many numbers in between not shown here.

Now, each of the given numbers in each of the three sections on that scale can be represented using IEEE 754.
So, it should be clear now that order of magnitude does not really matter (of course up to a point, where numbers become reeeeaally big - or equally really small towards zero).

And in each section you can also represent a large amount of numbers in-between the pin-pointed numbers.

The thing now is, that you cannot have numbers from all three or even just two sections of that scale together in a single floating-point computation, like addition or subtraction.
We can say that each section defines more or less a “window” of floating-point values that we can perform arithmetics on while expecting reasonable results.

Or put it another way: Any arithmetic operation we perform on floating-point values must result in a number that is expressible/representable as a floating-point entity.

Spread of Values

So, why are some numbers not representable as floating-point entities?

The reason is again “precision”.

Using a fixed precision of any fixed number of significant mantissa bits, the distance of any two numbers that can be represented using IEEE 754 floating-point depends on the magnitude of the values.

If a value you have is big in magnitude, like it is on the right section of above scale, you can only have differences that are about 6 decimal orders of magnitudes smaller, but not smaller than that.
So, if your value is 2E16 then you cannot add to it or subtract from it a simple 1E2 (i.e. 100).

This we have come to know as the “ulp” of floating-point values.
The unit in the last place for 2E16 in IEEE 754 32-bit is about 2.15E9.
So, we are somewhere in the second section of our above scale, and cannot even represent that full section here, because we cannot add 1E9 to 2E16, it’s too small.

So, while the magnitude of your floating-point values increases, all the different values you can still represent at that order of magnitude are “spreading” apart and become more and more distant to each other as your magnitude increases.
Hence, the ULP of any value (i.e. distance of two representable values at that order of magnitude) is proportional to the magnitude of the value.

Floating-point Arithmetics

Just because we cannot represent all numbers as floating-point values, the IEEE 754 standard defines rules for arithmetic operations specifying how those operations, such as addition, subtraction and multiplication must behave to yield the “best possible result.”
They do this in terms of the mentioned ULP. Most operations, such as addition and multiplication must operate in such a way that the exact mathematically correct result lies no further than 0.5 ULP from the IEEE 754-representable value.
This is why adding a small value to a big value will not give an exact result but rather “the best possible” result and also why chaining many arithmetic operations can cause the final result to diverge significantly from the mathematically correct value.
In some processors (such as GPUs) there are special aggregated arithmetic operations to combat this issue, such as fma (fused-multiply-add).

Actually, I wanted to have gotten into all that planetary noise thing right now, but that will come in the next article. :wink:

…To be continued…

O Riven, please add an appreciate button to articles as well, I can’t wait more to appreciate this article. Nice work Kai, I’m learning a lot from this.

It might be easier to some to understand if a result is insured to be within .5 ulp can also be called properly rounded.

Take addition. If we have two fp values ‘a’ and ‘b’ then the fp result of a+b is: F(a+b) where ‘a+b’ is logically performed as if in infinite precision and the function F rounds the result to a fp value. By the spec the rounding function ‘F’ can be configured. In Java you’re stuck with round to nearest.

IMHO talking in decimal makes things more complicated. The target audience is programmers so it’s reasonable to expect they’re comfortable in binary (or hex).

Thank you, Sri, for your appreciation!
Personally, for me it suffices when you just tell me via by public comment, as you did, or better just by PM.
Riven, it would be great if the editor field when editing the article after posting it were the same than when first writing the article. Currently post-editing an article has a simple text-editor without those buttons.