I presume most people reading this have used floating-point numbers at *some* point, often not intentionally.

I’m also fairly sure a good number who *have* encountered them did so after trying to discover why the result of a **simple computation is incorrect**. E.G.

```
0.1 + 0.2
// result: 0.30000000000000004
// Which I'm fairly sure
// should be 0.3 ...
```

The problem is that without understanding *what* a floating-point number is, you will often find yourself frustrated when the result you expected is ** ^{ever-so-slightly}** different.

My goal here is to clarify **what** a floating-point number is, **why** we use them, and **how** they work.

## Why do we even need floating point 🤔?

It’s not a *bold* statement to say that computers need to store numbers. We store these numbers in binary on our phones, laptops, fridges etc.

I hope that most people reading this are familiar with binary numbers in some form; if not, consider reading this blog post by Linda Vivah.

But what about decimal? Fractions, π, **real numbers**?

For *any useful* computation, we need computers to be able to represent the following:

- Very
^{veryveryvery}**SMALL**numbers, - Very
_{veryveryvery}**BIG**numbers, - Everything in-between!

Starting with the very^{veryveryvery} *small* numbers to take us in the right direction; how do we store them?

Well, thats simple. We store those using the equivalent to decimal representation in binary…

### Binary fractions

An example should help here. Let’s choose a random binary fraction: `101011.101`

This is very similar to how decimal numbers work. The only difference is that we have base `2`

instead of base `10`

. If you chuck the above in a binary converter of your choice, you’ll find that it is **correct!**

**So how might we store these binary fractions?**

Let’s say we allocate one byte (8 bits) to store our binary fraction: `00000000`

. We then must choose a place to put our *binary separator* so that we can have the fractional part of our binary number.

Let’s try smack-bang in the middle!

`0000.0000`

What’s the biggest number we can represent with this?

`1111.1111 = 15.9375`

That’s… not very impressive. Neither is the smallest number we can represent, `0.00625`

.

There is **a lot** of wasted storage space here alongside a poor range of possible numbers. The issue is that choosing any place to put our *point* would leave us with *either* **fractional precision** *or* a **larger integer** range — not both.

If we could just *move* the fractional *point* around as we needed, we could get so much more out of our limited storage. If only the *point* could *float* around as we needed it, a *floating point* if you will…

(I’m sorry, I had to 😅.)

## So what is floating point?

Floating point is exactly that, a *floating ( fractional) point* number that gives us the ability to change the relative size of our number.

So how do we mathematically represent a number in such a way that we,

- store the significant digits of the number we want to represent
**(E.G. the 12 in 0.00000012)**; - know where to put the fractional point in relation to the significant digits
**(E.G. all the 0’s in 0.00000012)**?

To do this, let’s time travel (for some) back to secondary school…

### Standard Form (Scientific Notation)

Anyone remember mathsisfun? I somehow feel old right now but either way, this is from their website:

You can do the *exact* same thing with binary! Instead of $7∗1_{2}=700$, we can write

$1010111100∗_{0}=700$$=10101111∗_{2}=700$

Which is equivalent to $175∗4=700$. This is a great way to represent a number based on the *significant digits* and the *placement of the fractional point* relative to said digits.

That’s it! **Floating point is a binary standard-form representation of numbers.**

If we want to formalise the representation a little, we need to account for positive *and* negative numbers. To do this, we also add a sign to our number by multiplying by $±1$:

$(sign)∗(significant digits)∗(base_{(some power)}$

And back to the example given by mathsisfun…

$700=(1)∗(7)∗(10_{2}$$=(1)∗(10101111)∗(2_{2}$

If you are reading other literature, you’ll find the representation will look something like…

$(−1_{s}∗c∗b_{e}$

Where $s$ is the sign bit, $c$ is the significand/mantissa, $b$ is the base, and $e$ is the exponent.

## So why am I getting weird errors 😡?

So, we know what floating point is now. But why can’t I do something as simple as add two numbers together without the risk of error?

Well the problem lies partially with the computer, but mostly with *mathematics itself*.

### Recurring Fractions

Recurring fractions are an interesting problem in number representation systems.

Let’s choose any fraction $yx $. If $y$ has a *prime factor* that isn’t also a *factor of the base*, it will be a recurring fraction.

This is why numbers like $1/21$ can’t be represented in a finite number of digits; $21$ has $7$ and $3$ as prime factors, neither of which are a factor of $10$.

Let’s work through an example in decimal.

#### Decimal

Say you want to add the numbers $1/3$ and $2/3$. We all know the answer is ~~42~~ $1$, but if we are working in decimal, this isn’t as obvious.

This is because $1/3=0.333333333333...$

It isn’t a number that can be represented as a *finite* number of digits in decimal. As we can’t store infinite digits, we store an *approximation accurate to 10 places*.

The calculation becomes…

$0.3333333333+0.6666666666=0.999999999$

Which is definitely not $1$. It’s *real* close, but it’s not $1$.

The finite nature in which we can store numbers doesn’t mesh well with the inevitable fact that we *can’t easily represent all numbers in a finite number of digits*.

#### Binary

This exact same problem occurs in binary, **except it’s even worse!** The reason for this is that $2$ has one less prime factor than $10$, namely $5$.

Because of this, recurring fractions happen more commonly in base $2$.

An example of this is $0.1$:

In decimal, that’s easy. In binary however… `0.00011001100110011...`

, it’s another recurring fraction!

So trying to perform `0.1 + 0.2`

becomes

```
0.0001100110011
+ 0.0011001100110
= 0.0100110011001
= 0.299926758
```

Now before we got something similar to `0.30000000004`

, this is because of things like *rounding modes* which I won’t go into here (but will do so in a future post). The same principle is causing this issue.

This number of errors introduced by this fact are lessened by introducing *rounding*.

### Precision

The other main issue comes in the form of *precision*.

We only have a *certain number of bits* dedicated to the significant digits of our floating point number.

#### Decimal

As an example, consider we have three *decimal* values to store our significant digits.

If we compare $333$ and $1/3∗1_{3}$, we would find that in our system they are the *exact same*.

This is because we only have three values of precision to store the significant digits of our number, and that involves truncating the end off of our recurring fraction.

In an extreme example, adding $1$ to $1∗1_{3}$ will result in $1∗1_{3}$, the number *hasn’t changed*. This is because you need *four* significant digits to represent $1001$.

#### Binary

This exact same issue occurs in binary with very^{veryveryvery} *small* and very_{veryveryvery} *big* numbers. In a future post I will be talking more about the limits of floating point.

For completeness, consider the previous example in binary, where we now have 3 *bits* to represent our significant digits.

By using base $2$ instead, `1 * 2^3`

, adding `1`

to our number will result in no change. This is because to represent `1001`

(now equivalent to $9$ in decimal) requires 4 bits, the *less significant* binary digits are *lost in translation*.

There is no solution here, the limits of precision are defined by the amount we can store. To get around this, use a larger floating point data type.

**E.G.**move from a*single*-precision floating-point number to a*double*-precision floating-point number.

## Summary

### TLDR

To bring it all together, floating-point numbers are a representation of binary values akin to *standard-form* or *scientific notation*.

It allows us to store **BIG** and ** _{small}** numbers with precision.

To do this we move the *fractional point* relative to the significant digits of the number to make that number bigger or smaller at a rate proportional to the base being used.

Most of the **errors** associated with floating point come in the form of representing *recurring fractions* in a *finite* number of bits. Rounding modes help to reduce these errors.

### Thanks 🥰!

**Thank you so much for reading!** I hope this was helpful to those that needed a little refresher on floating point and also to those that are new to this concept.

I will be making one or two updates to this post explaining the in-depths of the IEEE 754-2008 Floating Point Standard, so if you have questions like:

- “What are the biggest and smallest numbers I can use in floating point?”
- “What do financial institutions do about these errors?
- “Can we use other bases?”
- “How do we
*actually*perform floating-point arithmetic?”

Then feel free to follow to see an update! You can also follow me on twitter @tim_cb_roderick for updates. If you have any questions please feel free to leave a comment below.

from:https://timroderick.com/floating-point-introduction/