Category Archives: 算法

What’s Up With Floating Point?

I presume most people reading this have used floating-point numbers at some point, often not intentionally.

I’m also fairly sure a good number who have encountered them did so after trying to discover why the result of a simple computation is incorrect. E.G.

0.1 + 0.2
// result: 0.30000000000000004
// Which I'm fairly sure
// should be 0.3 ...

The problem is that without understanding what a floating-point number is, you will often find yourself frustrated when the result you expected is ever-so-slightly different.

My goal here is to clarify what a floating-point number is, why we use them, and how they work.

Why do we even need floating point 🤔?

It’s not a bold statement to say that computers need to store numbers. We store these numbers in binary on our phones, laptops, fridges etc.

I hope that most people reading this are familiar with binary numbers in some form; if not, consider reading this blog post by Linda Vivah.

But what about decimal? Fractions, π, real numbers?

For any useful computation, we need computers to be able to represent the following:

  • Veryveryveryvery SMALL numbers,
  • Veryveryveryvery BIG numbers,
  • Everything in-between!

Starting with the veryveryveryvery small numbers to take us in the right direction; how do we store them?

Well, thats simple. We store those using the equivalent to decimal representation in binary…

Binary fractions

An example should help here. Let’s choose a random binary fraction: 101011.101

 

This is very similar to how decimal numbers work. The only difference is that we have base 2 instead of base 10. If you chuck the above in a binary converter of your choice, you’ll find that it is correct!

So how might we store these binary fractions?

Let’s say we allocate one byte (8 bits) to store our binary fraction: 00000000. We then must choose a place to put our binary separator so that we can have the fractional part of our binary number.

Let’s try smack-bang in the middle!

0000.0000

What’s the biggest number we can represent with this?

1111.1111 = 15.9375

That’s… not very impressive. Neither is the smallest number we can represent, 0.00625.

There is a lot of wasted storage space here alongside a poor range of possible numbers. The issue is that choosing any place to put our point would leave us with either fractional precision or a larger integer range — not both.

If we could just move the fractional point around as we needed, we could get so much more out of our limited storage. If only the point could float around as we needed it, a floating point if you will…

(I’m sorry, I had to 😅.)

So what is floating point?

Floating point is exactly that, a floating (fractional) point number that gives us the ability to change the relative size of our number.

So how do we mathematically represent a number in such a way that we,

  1. store the significant digits of the number we want to represent (E.G. the 12 in 0.00000012);
  2. know where to put the fractional point in relation to the significant digits (E.G. all the 0’s in 0.00000012)?

To do this, let’s time travel (for some) back to secondary school…

Standard Form (Scientific Notation)

Anyone remember mathsisfun? I somehow feel old right now but either way, this is from their website:

 

You can do the exact same thing with binary! Instead of 7*10^2 = 700, we can write

1010111100 * 2^0 = 700= 10101111 * 2^2 = 700

Which is equivalent to 175 * 4 = 700. This is a great way to represent a number based on the significant digits and the placement of the fractional point relative to said digits.

That’s it! Floating point is a binary standard-form representation of numbers.

If we want to formalise the representation a little, we need to account for positive and negative numbers. To do this, we also add a sign to our number by multiplying by \pm 1:

(\text{sign}) * (\text{significant digits}) * (\text{base})^{(\text{some power})}

And back to the example given by mathsisfun…

700 = (1) * (7) * (10)^{2}= (1) * (10101111) * (2)^{2}

If you are reading other literature, you’ll find the representation will look something like…

(-1)^s * c * b^{e}

Where s is the sign bit, c is the significand/mantissa, b is the base, and e is the exponent.

So why am I getting weird errors 😡?

So, we know what floating point is now. But why can’t I do something as simple as add two numbers together without the risk of error?

Well the problem lies partially with the computer, but mostly with mathematics itself.

Recurring Fractions

Recurring fractions are an interesting problem in number representation systems.

Let’s choose any fraction \frac{x}{y}. If y has a prime factor that isn’t also a factor of the base, it will be a recurring fraction.

This is why numbers like 1/21 can’t be represented in a finite number of digits; 21 has 7 and 3 as prime factors, neither of which are a factor of 10.

Let’s work through an example in decimal.

Decimal

Say you want to add the numbers 1/3 and 2/3. We all know the answer is 42 1, but if we are working in decimal, this isn’t as obvious.

This is because 1/3 = 0.333333333333…

It isn’t a number that can be represented as a finite number of digits in decimal. As we can’t store infinite digits, we store an approximation accurate to 10 places.

The calculation becomes…

0.3333333333 + 0.6666666666 = 0.999999999

Which is definitely not 1. It’s real close, but it’s not 1.

The finite nature in which we can store numbers doesn’t mesh well with the inevitable fact that we can’t easily represent all numbers in a finite number of digits.

Binary

This exact same problem occurs in binary, except it’s even worse! The reason for this is that 2 has one less prime factor than 10, namely 5.

Because of this, recurring fractions happen more commonly in base 2.

An example of this is 0.1:

In decimal, that’s easy. In binary however… 0.00011001100110011..., it’s another recurring fraction!

So trying to perform 0.1 + 0.2 becomes

  0.0001100110011
+ 0.0011001100110
= 0.0100110011001
= 0.299926758

Now before we got something similar to 0.30000000004, this is because of things like rounding modes which I won’t go into here (but will do so in a future post). The same principle is causing this issue.

This number of errors introduced by this fact are lessened by introducing rounding.

Precision

The other main issue comes in the form of precision.

We only have a certain number of bits dedicated to the significant digits of our floating point number.

Decimal

As an example, consider we have three decimal values to store our significant digits.

If we compare 333 and 1/3 * 10^3, we would find that in our system they are the exact same.

This is because we only have three values of precision to store the significant digits of our number, and that involves truncating the end off of our recurring fraction.

In an extreme example, adding 1 to 1 * 10^3 will result in 1 * 10^3, the number hasn’t changed. This is because you need four significant digits to represent 1001.

Binary

This exact same issue occurs in binary with veryveryveryvery small and veryveryveryvery big numbers. In a future post I will be talking more about the limits of floating point.

For completeness, consider the previous example in binary, where we now have 3 bits to represent our significant digits.

By using base 2 instead, 1 * 2^3, adding 1 to our number will result in no change. This is because to represent 1001 (now equivalent to 9 in decimal) requires 4 bits, the less significant binary digits are lost in translation.

There is no solution here, the limits of precision are defined by the amount we can store. To get around this, use a larger floating point data type.

  • E.G. move from a single-precision floating-point number to a double-precision floating-point number.

Summary

TLDR

To bring it all together, floating-point numbers are a representation of binary values akin to standard-form or scientific notation.

It allows us to store BIG and small numbers with precision.

To do this we move the fractional point relative to the significant digits of the number to make that number bigger or smaller at a rate proportional to the base being used.

Most of the errors associated with floating point come in the form of representing recurring fractions in a finite number of bits. Rounding modes help to reduce these errors.

Thanks 🥰!

Thank you so much for reading! I hope this was helpful to those that needed a little refresher on floating point and also to those that are new to this concept.

I will be making one or two updates to this post explaining the in-depths of the IEEE 754-2008 Floating Point Standard, so if you have questions like:

  • “What are the biggest and smallest numbers I can use in floating point?”
  • “What do financial institutions do about these errors?
  • “Can we use other bases?”
  • “How do we actually perform floating-point arithmetic?”

Then feel free to follow to see an update! You can also follow me on twitter @tim_cb_roderick for updates. If you have any questions please feel free to leave a comment below.

from:https://timroderick.com/floating-point-introduction/