## Section16.4Floating Point Format

¶ permalinkThe most important concept in this section is that *Floating point* numbers are *not* *real numbers*. Real numbers include the continuum of all numbers from \(-\infty\) to \(+\infty\text{.}\) You already understand that computers are finite, so there is certainly a limit on the largest values that can be represented. But the problem is much worse than simply a size limit.

As you will see in this section, floating point numbers comprise a very small subset of real numbers. There are significant gaps between adjacent floating point numbers. These gaps can produce the following types of errors:

- Rounding
The number cannot be exactly represented in floating point.

- Absorption
A very small number gets lost when adding it to a large one.

- Cancellation
Subtraction of two numbers differing only in their few least significant digits gives a result that has few significant digits.

To make matters worse, these errors can occur in intermediate results, where they are very difficult to debug.

The idea behind floating point formats is to think of numbers written in scientific format. This notation requires two numbers to completely specify a value—a significand and an exponent. To review, a decimal number is written in scientific notation as a significand times ten raised to an exponent. For example,

\begin{align} 1,024 &= 1.024 \times 10^{3}\tag{16.4.1}\\ -0.000089372 &= -8.9372 \times 10^{-5}\tag{16.4.2} \end{align}Notice that the number is *normalized* such that only one digit appears to the left of the decimal point. The exponent of \(10\) is adjusted accordingly.

If we agree that each number is normalized and that we are working in base \(10\text{,}\) then each floating point number is completely specified by three items:

The significand.

The exponent.

The sign.

That is, in the above examples:

\(1024\text{,}\) \(3\text{,}\) and \(+\) represent \(1.024 \times 10^{3}\text{.}\) (The ‘\(+\)’ is understood.)

\(89372\text{,}\) \(-5\text{,}\) and \(-\) represent \(-8.9372 \times 10^{-5}\)

The advantage of using a floating point format is that, for a given number of digits, we can represent a larger range of values. To illustrate this, consider a four-digit, unsigned decimal system. The range of integers that could be represented is

\begin{gather*} 0 \le N \le 9999 \end{gather*}Now, let us allocate two digits for the significand and two for the exponent. We will restrict our scheme to unsigned numbers, but we will allow negative exponents. So we will need to use one of the exponent digits to store the sign of the exponent. We will use \(\binary{0}\) for positive and \(\binary{1}\) for negative. For example, \(3.9 \times 10^{-4}\) would be stored:

where each box holds one decimal digit.

Some other examples are:

\(1000\) | represents | \(1.0 \times 10^{0}\) |

\(3702\) | represents | \(3.7 \times 10^{2}\) |

\(9316\) | represents | \(9.3 \times 10^{-6}\) |

Our normalization scheme requires that there be a single non-zero digit to the left of the decimal point. We should also allow the special case of \(0.0\text{:}\)

\(0000\) | represents | \(0.0\) |

A little thought shows that this scheme allows numbers in the range

\begin{gather*} 1.0 \times 10^{-9}\le N \le 9.9 \times 10^{9} \end{gather*}That is, we have increased the range of possible values by a factor of \(10^{14}\text{!}\) However, it is important to realize that in both storage schemes, the integer and the floating point, we have exactly the same number of possible values—\(10^{4}\text{.}\)

Although floating point formats can provide a much greater range of numbers, the distance between any two adjacent numbers depends upon the value of the exponent. Thus, floating point is generally *less* accurate than an integer representation, especially for large numbers.

To see how this works, Figure 16.4.1 shows a plot of floating point values, using our current 4-digit scheme, in the range \(9.0 \times 10^{-1}\le N \le 2.0 \times 10^{0}\text{.}\)

Notice that the larger numbers are further apart than the smaller ones.

Let us pick some numbers from this range and perform some addition.

\(9111\) | represents | \(9.1 \times 10^{-1}\) |

\(9311\) | represents | \(9.3 \times 10^{-1}\) |

If we add these values, we get \(0.91 + 0.93 = 1.84\text{.}\) Now we need to round off our “on-paper” addition in order to fit this result into our current floating point scheme:

\(1800\) | represents | \(1.8 \times 10^{0}\) |

However,

\(9411\) | represents | \(9.1 \times 10^{-1}\) |

\(9311\) | represents | \(9.3 \times 10^{-1}\) |

and adding these values, we get \(0.94 + 0.93 = 1.87\text{.}\) Rounding off, we get:

\(1900\) | represents | \(1.9 \times 10^{0}\) |

So we see that starting with two values expressed to the nearest \(\sfrac{1}{100}\text{,}\) their sum is accurate only to the nearest \(\sfrac{1}{10}\text{.}\)

To compare this with fixed point arithmetic, we could use the same four digits to store \(0.93\) as in Figure 16.4.2.

It is clear that this storage scheme allows us to perform both additions, \((0.91 + 0.93)\) and \((0.94 + 0.93)\text{,}\) and store the results exactly.

These *round off errors* must be taken into account when performing floating point arithmetic. In particular, the errors can occur in intermediate results when doing even moderately complex computations, where they are very difficult to detect.