Specific floating point formats involve trade-offs between resolution, round off errors, size, and range. The most commonly used formats are the IEEE 754. They range in size from four to sixteen bytes. The most common sizes used in C/C++ are floats (4 bytes) and doubles (8 bytes). The ARM supports both sizes.

In the IEEE 754 4-byte format, one bit is used for the sign, eight for the exponent, and twenty-three for the significand. The IEEE 754 8-byte format specifies one bit for the sign, eleven for the exponent, and fifty-two for the significand.

The bit patterns for floats and doubles are arranged as shown in Figure 16.4.1.

In this section we describe the 4-byte format in order to save ourselves (hand) computation effort. The goal is to get a feel for the limitations of floating point formats. The normalized form of a floating point number in binary is:

\begin{gather}
N = (-1)^{s} \times 1.f \times 2^{e}\label{eq-ieee}\tag{16.4.1}
\end{gather}

where: \(s\) is the sign bit, \(f\) is the 23-bit fractional part of the significand, and \(e\) is the 8-bit exponent.

As in decimal, the exponent is adjusted such that there is only one non-zero digit to the left of the binary point. In binary, though, this digit is always one. Since it is always one, it need not be stored. Only the fractional part of the normalized value needs to be stored as the significand. This adds one bit to the significance of the fractional part. The integer part (one) that is not stored is sometimes called the hidden bit.

The sign bit, \(s\text{,}\) refers to the number. Another scheme is used to represent the sign of the exponent, \(e\text{.}\) Your first thought is probably to use two's complement. However, the IEEE format was developed in the 1970s, when floating point computations took a lot of CPU time. Many algorithms depend upon only the comparison of two numbers, and the computer scientists of the day realized that a format that allowed integer comparison instructions would result in faster execution times. So they decided to store a biased exponent as an unsigned int. The amount of the bias is one-half the range allocated for the exponent. In the case of an 8-bit exponent, the bias amount is \(127\text{.}\)

The hidden bit scheme presents a problem—there is no way to represent zero. To address this issue, the IEEE 754 standard has several special cases:

Zero Value

Shown by setting all the bits in the exponent and significand to zero. Notice that this allows for \(-0.0\) and \(+0.0\text{,}\) although \((-0.0 == +0.0)\) computes to true.

Denormalized Value

Shown by setting all the bits in the exponent to zero. In this case there is no hidden bit. Zero can be thought of as a special case of denormalized.

Infinity

Shown by setting all the bits in the exponent to one and all the bits in the significand to zero. The sign bit allows for \(-\infty\) and \(+\infty\text{.}\)

NaN

Shown by setting all the bits in the exponent to one, and the significand to non-zero. This is used when the results of an operation are undefined. For example, \(\pm{}nonzero \div 0.0\) yields infinity, but \(\pm{}0.0 \div \pm{}0.0\) yields NaN.

Example16.4.2

Show how \(97.8125\) is stored in 32-bit IEEE 754 binary format.