Floating point from scratch: Hard Mode

A deep dive into the complexities of floating-point arithmetic, documenting a developer's journey to master the IEEE 754 standard from scratch.
I have a confession to make: floating point scares me.
Half a decade ago I decided that I was going to implement some floating point arithmetic. Back then it seemed approachable enough, after all, floating points are ubiquitous. How hard can it really be ? My experience until that point had been: given enough time and effort is spent bashing my brain against a problem, I can generally figure things out.
This is how I faced the most complete technical defeat of my existence. Through this utter annihilation emerged my present fear of floating point.
After half a decade I decided it was time for a rematch, time to face my dragons!
But this time, I would not simply aim for a surface level understanding, this time I would aim to deeply grasp the floating point representation.
When setting out on this crusade, I believed that there were only 3 types of people who truly understood floating point :
- The people writing the spec
- The math PhDs working on the floating point representation
- The people building the floating point hardware
Welcome to round 2!
Chapter 1: Descent into madness#
Looking back on it, one of the main reasons behind my past defeat was that I mistook my ability to use floating points for a marker of understanding. And that this freed me from the need to invest the time in studying floating point, as if I was going to pick it up along the way.
So it’s now time to put the computer aside, and spend 10 days in the company of paper. ( you remember, the white stuff )
How floating point works#
I am assuming that readers already have some surface level knowledge of what floating point is, so I will spare you the basic intro.
Let me just set a few definitions, in the context of this discussion normal floating point numbers will be defined as:
With the values of (S), (E) and (T) being the values stored in the floating point fields:
- (S) sign bit
- (E) biased exponent
- (T) trailing significant field
The size of these fields, as well as the values of (b) (exponent bias) and (p) (precision) depend on the floating point format.
Eg, for the IEEE 754 single precision (float32_t
) we have:
- (b = 127)
- (p = 24)
Resulting in:
$$ (-1)^{S} \times 2^{E−127} \times (1 + T \cdot 2^{-23}) $$In this discussion we will be calling :
- sign, the (S) sign bit
- exponent, the value stored in the biased exponent field (E)
- significant/mantissa, the value stored in the (T) field
The term mantissa isn’t pedantically correct since this isn’t a logarithmic representation it should really be called a significant. But my fellow programmers in the audience will appreciate that since the sign has already used the (s) name for our single letter naming of our structure elements we have no choice but to yield and call this (m) for mantissa. I will be using the term mantissa and significant interchangeably in this article.
What you never wanted to know#
We are not actually interested in floating point in the abstract, but rather what we commonly refer to as “float” in our programs.
In the world of all the possible floating point types, these are the vanillas, except in this world, everyone also wants vanilla all the time!
This float format is canonized by the IEEE in the IEEE 754 specification. Inside this holy grail is where the expected behavior is outlined in excruciating detail making it possible for users to expect the same behavior for the same floating point operations on different platforms. A cornerstone of making float ~~portable~~.
Also, this is where hell starts!
+0/-0#
Let us commence our descent slowly.
As the most astute readers might have already noticed looking (kudos) at the representation format, we have a real sign bit. This implies that we actually have 2 representations for zero: (+0.0) and (-0.0).
Now where things get fun is that we have rules around which zero to use. For example, let us consider how we would determine the equality between two floating point numbers, say X == Y
?
To do this comparison we would generally re-use the adder and do X - Y
then check all the result’s bits are 0, problem is (-0.0) is written with an 1 in its sign bit.
So we have rules around when the result should use (+0.0) or (-0.0), and the subtracting of two equal floating point numbers is such an example of this rule :
NaN#
NaN
for Not A Number.
For all of you that thought we were talking about numbers, this is the point at which you start understanding the difference between a number and a representation format.
So let’s start with the fun bit, there are actually different types of NaN
’s:
q
uietNaN
s (qNaN
s) that you would typically encounter from your bad math.s
ignalingNaN
s (sNaN
s) the ones bad math doesn’t produce and also the ones that scream at you by signaling an invalid operation exception whenever they appear as operands. Most people won’t encounter these.
So, what do I mean by “ qNaNs are used to indicate when the result of an arithmetic operation cannot be represented”?
Here are a few examples for clarification :
- (\sqrt{-1.0}) results in an
qNaN
as (\sqrt{-1.0} = i), and (i) is an imaginary number that cannot be represented without the use of complex notation. - (\frac{0.0}{0.0}) would also result in a
qNaN
because: what are you doing ? - (+\infty - \infty) would also result in a
qNaN
because (\pm\infty) are actually limits, not numbers. And subtracting a limit from another limit (+\infty - \infty) just doesn’t make sense.
Want to know another fun fact about qNaN
s ?
They are contagious.
Arithmetic operations with a qNaN
as an operand will result in a qNaN
.
Think about it: what result should you give for an operation whose result can’t be represented ?
In memory NaN
s are represented with all the exponent bits set to (1) and with at least one of the significant bits set. You can then differentiate different NaN
s based on which significant bit(s) are set, the encoding of which is left to the discretion of the implementer.
Infinitys#
So we have already started introducing these with the NaN
s, but the floating point representation has room for two infinity notations: one for (+\infty) and its mirror (-\infty). These are not numbers, infinity is not a number it’s a limit!
In compliance with IEEE certain specific infinities can be used in arithmetic operations, be used as inputs for boolean operations and be produced as the result of a calculation.
In memory, infinities have their exponent bits set to all (1)s, and to differentiate them from NaN
s their significant bits are all (0)s.
Denormal#
Let’s put infinities and NaN
s on the side for a minute and get back to talking just about numbers.
In the introduction I defined a normal floating point number as:
A more common way of writing this is:
Where (m) is a number represented by a string of the form (d_0 . d_1 d_2 ... d_{p-1}), and is (p) long.(with (p) the precision, or number of bits in the significant + 1 ).
For example (1.5) would be written as :
and (3) as (2 × 1.5) :
In our normal floating point representation, the (1) in ((1 + T · 2^{1−p})) is our (d_0) and is always set to (d_0 = 1).
Now, the funny thing is our significant actually only has (p-1) bits, and (d_0) is actually an inferred bit, we call it the hidden bit
.
Seems simple enough ? Could something finally be simple about floating point ?!
Don’t worry: floating point isn’t going to let you down like this, because we have another category of numbers!
They have an implicit hidden bit set to (d_0 = 0) and are called subnormal numbers
(or denormal numbers)
. Yay 🥳
These are used to encode the smallest representable floating point numbers, and were the most controversial part of the IEEE 574 spec during its elaboration.
They are also a giant pain in the ass to implement, so
Source: Hacker News












