14 Fractional Numbers

Chapter 14
Fractional Numbers

So far in this book we have used only integers for numerical values. In this chapter you will see two methods for storing fractional values — ﬁxed point and ﬂoating point. Storing numbers in ﬁxed point format requires that the programmer keep track of the location of the binary point¹ within the bits allocated for storage. In the ﬂoating point format, the number is essentially written in scientiﬁc format and both the signiﬁcand² and exponent are stored.

14.1 Fractions in Binary

Before discussing the storage formats, we need to think about how fractional values are stored in binary. The concept is quite simple. We can extend Equation 2.6:

N = dn−1 × 2n−1 + dn−2 × 2n−2 + ...+d1 × 21 + d0 × 20 + d−1 ×2− 1 + d−2 × 2−2 +...

(14.1)

For example,

123.687510 = 1111011.10112

because

d₋₁ × 2⁻¹	= 1× 0.5
d₋₂ × 2⁻²	= 0× 0.25
d₋₃ × 2⁻³	= 1× 0.125
d₋₄ × 2⁻⁴	= 1× 0.0625

and thus

0.1011₂	= 0.5₁₀ + 0.125₁₀ + 0.0625₁₀
	= 0.6875₁₀	(14.2)

See Exercise 14-1 for an algorithm to convert decimal fractions to binary. We assume that you can convert the integral part and that Equation 14.1 is suﬃcient for converting from binary to decimal.

Although any integer can be represented as a sum of powers of two, an exact representation of fractional values in binary is limited to sums of inverse powers of two. For example, consider an 8-bit representation of the fractional value 0.9. From

0.11100110₂	= 0.89843750₁₀
0.11100111₂	= 0.90234375₁₀

we can see that

0.111001102 < 0.910 < 0.111001112

(14.3)

In fact,

----- 0.910 = 0.1110011002

(14.4)

where 1100 means that this bit pattern repeats forever.

Rounding oﬀ fractional values in binary is very simple. If the next bit to the right is one, add one to the bit position where rounding oﬀ. In the above example, we are rounding oﬀ to eight bits. The ninth bit to the right of the binary point is zero, so we do not add one in the eighth bit position. Thus, we use

0.910 = 0.1110 01102

(14.5)

which gives a round oﬀ error of

0.9₁₀ −0.11100110₂	= 0.9₁₀ − 0.8984375₁₀
	= 0.0015625₁₀	(14.6)

We note here that two’s complement also works correctly for storing negative fractional values. You are asked to show this in Exercise 14-2.

14.2 Fixed Point ints

In a ﬁxed point format, the storage area is divided into the integral part and the fractional part. The programmer must keep track of where the binary point is located. For example, we may decide to divide a 32-bit int in half and use the high-order 16 bits for the integral part and the low-order 16 bits for the fractional part.

My bank provides me with printed deposit slips that use this method. There are seven boxes for numerals. There is also a decimal point printed just to the left of the rightmost two boxes. Note that the decimal point does not occupy a box. That is, there are no digits allocated for the decimal point. So the bank assumes up to ﬁve decimal digits for the “dollars” part (rather optimistic), and the rightmost two decimal digits represent the “cents” part. The bank’s printing tells me how they have allocated the digits, but it is my responsibility to keep track of the location of the decimal point when ﬁlling in the digits.

One advantage of a ﬁxed point format is that integer instructions can be used for arithmetic computations. Of course, the programmer must be very careful to keep track of which bits are allocated for the integral part and which for the fractional part. And the range of possible values is restricted by the number of bits.

An example of using ints for ﬁxed point addition is shown in Listing 14.1.

1/*
2 * rulerAdd.c
3 * Adds two ruler measurements, to nearest 1/16th inch.
4 * Bob Plantz - 18 June 2009
5 */
6#include <stdio.h>
7
8int main(void)
9{
10    int x, y, fraction_part, sum;
11
12    printf("Enter first measurement, inches: ");
13    scanf("%d", &x);
14    x = x << 4;         /* shift to integral part of variable */
15    printf("                     sixteenths: ");
16    scanf("%d", &fraction_part);
17    x = x | (0xf & fraction_part);  /* add in fractional part */
18
19    printf("Enter second measurement, inches: ");
20    scanf("%d", &y);
21    y = y << 4;         /* shift to integral part of variable */
22    printf("                      sixteenths: ");
23    scanf("%d", &fraction_part);
24    y = y | (0xf & fraction_part);  /* add in fractional part */
25
26    sum = x + y;
27    printf("Their sum is %d and %d/16 inches\n",
28          (sum >> 4), (sum & 0xf));
29
30    return 0;
31}

Listing 14.1: Fixed point addition. The high-order 28 bits are used for the integral part, the low-order 4 for the fractional part.)

The numbers are input to the nearest 1/16th inch, so the programmer has allocated four bits for the fractional part. This leaves 28 bits for the integral part. After the integral part is read, the stored number must be shifted four bit positions to the left to put it in the high-order 28 bits. Then the fractional part (in number of sixteenths) is added into the low-order four bits with a simple bit-wise or operation. Printing the answer also requires some bit shifting and some masking to ﬁlter out the fractional part.

This is clearly a contrived example. A program using floats would work just as well and be somewhat easier to write. However, the program in Listing 14.1 uses integer instructions, which execute faster than ﬂoating point. The hardware issues have become less signiﬁcant in recent times. Modern CPUs use various parallelization schemes such that a mix of ﬂoating point and integer instructions may actually execute faster than only integer instructions. Fixed point arithmetic is often used in embedded applications where the CPU is small and may not have ﬂoating point capabilities.

14.3 Floating Point Format

The most important concept in this section is that ﬂoating point numbers are not real numbers.³ Real numbers include the continuum of all numbers from −∞ to +∞. You already understand that computers are ﬁnite, so there is certainly a limit on the largest values that can be represented. But the problem is much worse than simply a size limit.

As you will see in this section, ﬂoating point numbers comprise a very small subset of real numbers. There are signiﬁcant gaps between adjacent ﬂoating point numbers. These gaps can produce the following types of errors:

Rounding — the number cannot be exactly represented in ﬂoating point.
Absorption — a very small number gets lost when adding it to a large one.
Cancellation — a very small number gets lost when subtracting it from a large one.

To make matters worse, these errors can occur in intermediate results, where they are very diﬃcult to debug.

The idea behind ﬂoating point formats is to think of numbers written in scientiﬁc format. This notation requires two numbers to completely specify a value — a signiﬁcand and an exponent. To review, a decimal number is written in scientiﬁc notation as a signiﬁcand times ten raised to an exponent. For example,

1,024	= 1.024 × 10³	(14.7)
− 0.000089372	= −8.9372 × 10⁻⁵	(14.8)

Notice that the number is normalized such that only one digit appears to the left of the decimal point. The exponent of 10 is adjusted accordingly.

If we agree that each number is normalized and that we are working in base 10, then each ﬂoating point number is completely speciﬁed by three items:

The signiﬁcand.
The exponent.
The sign.

That is, in the above examples

1024, 3, and + represent 1.024 × 10³ (The “+” is understood.)
89372, -5, and - represent 8.9372 × 10⁻⁵

The advantage of using a ﬂoating point format is that, for a given number of digits, we can represent a larger range of values. To illustrate this, consider a four-digit, unsigned decimal system. The range of integers that could be represented is

0 ≤ N ≤ 9999

Now, let’s allocate two digits for the signiﬁcand and two for the exponent. We will restrict our scheme to unsigned numbers, but we will allow negative exponents. So we will need to use one of the exponent digits to store the sign of the exponent. We will use 0 for positive and 1 for negative. For example, 3.9 × 10⁻⁴ would be stored

PICT

where each box holds one decimal digit. Some other examples are:

1000	represents	1.0 × 10⁰
3702	represents	3.7 × 10²
9316	represents	9.3 × 10⁻⁶

Our normalization scheme requires that there be a single non-zero digit to the left of the decimal point. We should also allow the special case of 0.0:

0000	represents	0.0

A little thought shows that this scheme allows numbers in the range

1.0× 10−9 ≤ N ≤ 9.9× 109

That is, we have increased the range of possible values by a factor of 10¹⁴! However, it is important to realize that in both storage schemes, the integer and the ﬂoating point, we have exactly the same number of possible values — 10⁴.

Although ﬂoating point formats can provide a much greater range of numbers, the distance between any two adjacent numbers depends upon the value of the exponent. Thus, ﬂoating point is generally less accurate than an integer representation, especially for large numbers.

To see how this works, let’s look at a plot of numbers (using our current scheme) in the range

9.0× 10−1 ≤ N ≤ 2.0× 100

PICT

Notice that the larger numbers are further apart than the smaller ones. (See Exercise 14-7 after you read Section 14.4.)

Let us pick some numbers from this range and perform some addition.

9111	represents	9.1 × 10⁻¹
9311	represents	9.3 × 10⁻¹

If we add these values, we get 0.91 + 0.93 = 1.84. Now we need to round oﬀ our “paper” addition in order to ﬁt this result into our current ﬂoating point scheme:

1800	represents	1.8 × 10⁰

On the other hand,

9411	represents	9.4 × 10⁻¹
9311	represents	9.3 × 10⁻¹

and adding these values, we get 0.94 + 0.93 = 1.87. Rounding oﬀ, we get:

1900	represents	1.9 × 10⁰

So we see that starting with two values expressed to the nearest 1/100th, their sum is accurate only to the nearest 1/10.

To compare this with ﬁxed point arithmetic, we could use the same four digits to store 0.93 this way

PICT

It is clear that this storage scheme allows us to perform both additions (0.91 + 0.93 and 0.94 + 0.93) and store the results exactly.

These round oﬀ errors must be taken into account when performing ﬂoating point arithmetic. In particular, the errors can occur in intermediate results when doing even moderately complex computations, where they are very diﬃcult to detect.

14.4 IEEE 754

Speciﬁc ﬂoating point formats involve trade-oﬀs between re4solution, round oﬀ errors, size, and range. The most commonly used formats are the IEEE 754.⁴ They range in size from four to sixteen bytes. The most common sizes used in C/C++ are floats (4 bytes) and doubles (8 bytes). The x86 processor performs ﬂoating point computations using an extended 10-byte format. The results are rounded to 4-byte mode if the programmer uses the float data type or 8-byte mode for the double data type.

In the IEEE 754 4-byte format, one bit is used for the sign, eight for the exponent, and twenty-three for the signiﬁcand. The IEEE 754 8-byte format speciﬁes one bit for the sign, eleven for the exponent, and ﬁfty-two for the signiﬁcand.

In this section we describe the 4-byte format in order to save ourselves (hand) computation eﬀort. The goal is to get a feel for the limitations of ﬂoating point formats. The normalized form of the number in binary is given by Equation 14.9.

N = (− 1)s × 1.f × 2e

(14.9)

where:	s is the sign bit
	f is the 23-bit fractional part
	e is the 8-bit exponent

The bit patterns for floats and doubles are arranged as shown in Figure 14.1.

PICT

Figure 14.1: IEEE 754 bit patterns. (a) Float. (b) Double.

As in decimal, the exponent is adjusted such that there is only one non-zero digit to the left of the binary point. In binary, though, this digit is always one. Since it is always one, it need not be stored. Only the fractional part of the normalized value needs to be stored as the signiﬁcand. This adds one bit to the signiﬁcance of the fractional part. The integer part (one) that is not stored is sometimes called the hidden bit.

The sign bit, s, refers to the number. Another mechanism is used to represent the sign of the exponent, e. Your ﬁrst thought is probably to use two’s complement. However, the IEEE format was developed in the 1970s, when ﬂoating point computations took a lot of CPU time. Many algorithms depend upon only the comparison of two numbers, and the computer scientists of the day realized that a format that allowed integer comparison instructions would result in faster execution times. So they decided to store a biased exponent as an unsigned int. The amount of the bias is one-half the range allocated for the exponent. In the case of an 8-bit exponent, the bias amount is 127.

Example 14-a ____________________________________________________________________________________________________

Show how 97.8125 is stored in 32-bit IEEE 754 binary format.

Solution:

First, convert the number to binary.

97.8125₁₀	= 1100001.1101₂
	= (−1)⁰ ×1100001.1101× 2⁰

Adjust the exponent to obtain the normalized form.

(− 1)0 × 1100001.1101 × 20 = (− 1)0 × 1.1000011101× 26

Compute s, e+127, and f.

s	= 0
e + 127	= 6 + 127
	= 133
	= 10000101₂
f	= 10000111010000000000000

Finally, use Figure 14.1 to place the bit patterns. (Remember that the hidden bit is not stored; it is understood to be there.)

97.8125	is stored as	0 10000101 10000111010000000000000₂
	or	42c3a000₁₆

_______________________________________________________________________________□

Example 14-b ____________________________________________________________________________________________________

Using IEEE 754 32-bit format, what decimal number does the bit pattern 3e400000₁₆ represent?

Solution:

First, convert the hexadecimal to binary, using spaces suggested by Figure 14.1.

3e400000 = 0 01111100 10000000000000000000000 16 2

Now compute the values of s, e, and f.

s	= 0
e + 127	= 01111100₂
	= 124₁₀
e	= −3₁₀
f	= 10000000000000000000000

Finally, plug these values into Equation 14.9. (Remember to add the hidden bit.)

(−1)⁰ ×1.100…00× 2⁻³	= (−1)⁰ ×0.0011× 2⁰
	= 0.1875₁₀

__________________________________________________________________________________________________________________________________________________________□

Example 14-c ____________________________________________________________________________________________________

Using IEEE 754 32-bit format, what decimal number would the bit pattern 00000000₁₆ represent? (The speciﬁcation states that it is an exception to the format and is deﬁned to represent 0.0. This example provides some motivation for this exception.)

Solution:

The conversion to binary is trivial. Computing the values of s, e, and f.

s	= 0
e + 127	= 00000000₂
e	= −127₁₀
f	= 00000000000000000000000

Finally, plug these values into Equation 14.9. (Remember to add the hidden bit.)

(− 1)0 × 1.00...00× 2−127 = very small number

__________________________________________________________________________________□

This last example illustrates a problem with the hidden bit — there is no way to represent zero. To address this issue, the IEEE 754 standard has several special cases.

Zero — all the bits in the exponent and signiﬁcand are zero. Notice that this allows for -0.0 and +0.0, although (-0.0 == +0.0) computes to true.
Denormalized — all the bits in the exponent are zero. In this case there is no hidden bit. Zero can be thought of as a special case of denormalized.
Inﬁnity — all the bits in the exponent are one, and all the bits in the signiﬁcand are zero. The sign bit allows for −∞ and +∞.
NaN — all the bits in the exponent are one, and the signiﬁcand is non-zero. This is used when the results of an operation are undeﬁned. For example, ±nonzero ÷ 0.0 yields inﬁnity, but ±0.0 ÷ ±0.0 yields NaN.

14.5 Floating Point Hardware

Until the introduction of the Intel 486DX in4 April 1989, the x87 Floating Point Unit was on a separate chip. It is now included on the CPU chip although it uses a somewhat diﬀerent execution architecture than the Integer Unit in the CPU.

In 1997 Intel added MMX™(Multimedia Extensions) to their processors, which includes instructions that process multiple data values with a single instruction (SIMD). Operations on single data items are called scalar operations, while those on multiple data items in parallel are called vector operations. Vector operations are useful for many multi-media and scientiﬁc applications. In this book we will discuss only scalar ﬂoating point operations. Originally, MMX only performed integer computations. But in 1998 AMD added the 3DNow!™extension to MMX, which includes ﬂoating point instructions. Intel soon followed suit.

Intel then introduced SSE (Streaming SIMD Extension) on the Pentium III in 1999. Several versions have evolved over the years — SSE, SSE2, SSE3, and SSE4A — as of this writing. There are instructions for performing both integer and ﬂoating point operations on both scalar and vector values.

The x86-64 architecture includes three sets of instructions for working with ﬂoating point values:

SSE2 instructions operate on 32-bit or 64-bit values.⁵ Four 32-bit values or two 64-bit values can be processed simultaneously.
x87 Floating Point Unit instructions operate on 80-bit values.
3DNow! instructions operate on two 32-bit values.

All three ﬂoating point instruction sets include a wide variety of instructions to perform the following operations:

Move data from memory to a register, from a register to memory, and from a register to another register.
Convert data from integer to ﬂoating point, and from ﬂoating point to integer formats.
Perform the usual add, subtract, multiply, and divide arithmetic operations. They also provide square root instructions.
Compare two values.
Perform the usual and, or, and xor logical operations.

In addition, the x87 includes instructions for transcendental functions — sine, cosine, tangent, and arc tangent, and logarithm functions.

We will not cover all the instructions in this book. The following subsections provide an introduction to how each of the three sets of instructions is used. See the manuals [2] – [6] and [14] – [18] for details.

14.5.1 SSE2 Floating Point

Most of the SSE2 instructions operate on multiple data items simultaneously — single instruction, multiple data (SIMD). There are SSE2 instructions for both integer and ﬂoating point operations. Integer instructions operate on up to sixteen 8-bit, eight 16-bit, four 32-bit, two 64-bit, or one 128-bit integers at a time. Vector ﬂoating point instructions operate on all four 32-bit or both 64-bit ﬂoats in a register simultaneously. Each data item is treated independently. These instructions are useful for algorithms that do things like process arrays. One SSE2 instruction can operate on several array elements in parallel, resulting in considerable speed gains. Such algorithms are common in multi-media and scientiﬁc applications.

In this book we will only consider some of the scalar ﬂoating-point instructions, which operate on only single data items. The SSE2 instructions are the preferred ﬂoating-point implementation in 64-bit mode. These instructions operate on either 32-bit (single-precision) or 64-bit (double-precision) values. The scalar instructions operate on only the low-order portion of the 128-bit xmm registers, with the high-order 64 or 96 bits remaining unchanged.

SSE includes a 32-bit MXCSR register that has ﬂags for handling ﬂoating-point arithmetic errors. These ﬂags are shown in Table 14.1.

bits	mnemonic	meaning	default

31 – 18	–	reserved
17	MM	Misaligned Exception Mask	0
16	–	reserved
15	FZ	Flush-toZero for Masked Underﬂow	0
14 – 13	RC	Floating-Point Rounding Control	00
12	PM	Precision Exception Mask	1
11	UM	Underﬂow Exception Mask	1
10	OM	Overﬂow Exception Mask	1
9	ZM	Zero-Divide Exception Mask	1
8	DM	Denormalized-Operand Exception Mask	1
7	IM	Invalid-Operation Exception Mask	1
6	DAZ	Denormals Are Zero	0
5	PE	Precision Exception	0
4	UE	Underﬂow Exception 4	0
3	OE	Overﬂow Exception	0
2	ZE	Zero-Divide Exception	1
1	DE	Denormalized-Operand Exce4ption	0
0	IE	Invalid-Operation Exception	0

Table 14.1: MXCSR status register.

SSE instructions that perform arithmetic operations and the SSE compare instructions also aﬀect the status ﬂags in the rflags register. Thus the regular conditional jump instructions (Section 10.1.2, page 676) are used to control program ﬂow based on ﬂoating-point computations.

The instruction mnemonics used by the gnu assembler are mostly the same as given in the manuals, [2] – [6] and [14] – [18]. Since they are quite descriptive with respect to operand sizes, a size letter is not appended to the mnemonic, except when one of the operands is in memory and the size is ambiguous. Of course, the operand order used by the gnu assembler is still reversed compared to the manufacturers’ manuals, and the register names are preﬁxed with the “%” character.

A very important set of instructions provided for working with ﬂoating point values are those to convert between integer and ﬂoating point formats. The scalar conversion SSE2 instructions are shown in Table 14.2.

mnemonic	source	destination	meaning

cvtsd2si	xmm reg.	32-bit integer	convert scalar 64-bit
	or mem.	reg.	ﬂoat to 32-bit integer

cvtsd2ss	xmm reg.	xmm reg. or	convert scalar 64-bit
		mem.	ﬂoat to 32-bit ﬂoat

cvtsi2sd	integer reg.	xmm reg.	convert 32-bit integer
	or mem.		to scalar 64-bit ﬂoat

cvtsi2sdq	integer reg.	xmm reg.	convert 64-bit integer
	or mem.		to scalar 64-bit ﬂoat

cvtsi2ss	integer reg.	xmm reg.	convert 32-bit integer
	or mem.		to scalar 32-bit ﬂoat

cvtsi2ssq	integer reg.	xmm reg.	convert 64-bit integer
	or mem.		to scalar 32-bit ﬂoat

cvtss2sd	xmm reg.	another xmm reg.	convert scalar 32-bit
		or mem.	ﬂoat to 64-bit ﬂoat

cvtss2si	xmm reg.	32-bit integer	convert scalar 32-bit
	or mem.	reg.	ﬂoat to 32-bit integer

cvtss2siq	xmm reg.	64-bit integer	convert scalar 32-bit
	or mem.	reg.	ﬂoat to 64-bit integer

Table 14.2: SSE scalar ﬂoating point conversion instructions. Source and destination xmm registers must be diﬀerent. The low-order portion of the xmm register is used. When reading from or writing to memory, the “q” suﬃx is used to designate a 64-bit value.

Data movement and arithmetic instructions must distinguish between scalar and vector operations on values in the xmm registers. The low-order portion of the register is used for scalar operations. Vector operations are performed on multiple data values packed into a single register. See Table 14.3 for a sampling of SSE2 data movement and arithmetic instructions.

mnemonic	source	destination	meaning

addps	xmm reg. or mem.	xmm reg.	add packed 32-bit ﬂoats

addpd	xmm reg. or mem.	xmm reg.	add packed 64-bit ﬂoats

addss	xmm reg. or mem.	xmm reg.	add scalar 32-bit ﬂoats

addsd	xmm reg. or mem.	xmm reg.	add scalar 64-bit ﬂoats

divps	xmm reg. or mem.	xmm reg.	divide packed 32-bit ﬂoats

divpd	xmm reg. or mem.	xmm reg.	divide packed 64-bit ﬂoats

divss	xmm reg. or mem.	xmm reg.	divide scalar 32-bit ﬂoats

divsd	xmm reg. or mem.	xmm reg.	divide scalar 64-bit ﬂoats

movss	xmm reg. or mem.	xmm reg.	move scalar 32-bit ﬂoat

movss	xmm reg.	xmm reg. or mem.	move scalar 32-bit ﬂoat

movsd	xmm reg. or mem.	xmm reg.	move scalar 64-bit ﬂoat

movsd	xmm reg.	xmm reg. or mem.	move scalar 64-bit ﬂoat

mulps	xmm reg. or mem.	xmm reg.	multiply packed 32-bit ﬂoats

mulpd	xmm reg. or mem.	xmm reg.	multiply packed 64-bit ﬂoats

mulss	xmm reg. or mem.	xmm reg.	multiply scalar 32-bit ﬂoats

mulsd	xmm reg. or mem.	xmm reg.	multiply scalar 64-bit ﬂoats

subps	xmm reg. or mem.	xmm reg.	subtract packed 32-bit ﬂoats

subpd	xmm reg. or mem.	xmm reg.	subtract packed 64-bit ﬂoats

subss	xmm reg. or mem.	xmm reg.	subtract scalar 32-bit ﬂoats

subsd	xmm reg. or mem.	xmm reg.	subtract scalar 64-bit ﬂoats

Table 14.3: Some SSE ﬂoating point arithmetic and data movement instructions. Source and destination xmm registers must be diﬀerent. Scalar instructions use the low-order portion of the xmm registers.

Notice that the code for the basic operation is followed by a “p” or “s” for “packed” or “scalar.” This character is then followed by a “d” or “s” for “double” (64-bit) or “single” (32-bit) data item.

We will use the program in Listing 14.2 to illustrate a few ﬂoating point operations.

1/*
2 * frac2float.c
3 * Converts fraction to floating point.
4 * Bob Plantz - 18 June 2009
5 */
6
7#include <stdio.h>
8
9int main(void)
10{
11    int x, y;
12    double z;
13
14    printf("Enter two integers: ");
15    scanf("%i %i", &x, &y);
16    z = (double)x / y;
17    printf("%i / %i = %lf\n", x, y, z);
18    return 0;
19 }

Listing 14.2: Converting a fraction to a ﬂoat.

Compiling this program in 64-bit mode produced the assembly language in Listing 14.3.

1        .file  "frac2float.c"
2        .section      .rodata
3.LC0:
4        .string"Enter two integers: "
5.LC1:
6        .string"%i %i"
7.LC2:
8        .string"%i / %i = %lf\n"
9        .text
10        .globlmain
11        .type  main, @function
12main:
13        pushq  %rbp
14        movq  %rsp, %rbp
15        subq  $32, %rsp
16        movl  $.LC0, %edi
17        movl  $0, %eax
18        call  printf
19        leaq  -12(%rbp), %rdx     # address of y
20        leaq  -16(%rbp), %rax     # address of x
21        movq  %rax, %rsi
22        movl  $.LC1, %edi
23        movl  $0, %eax            # no xmm arguments
24        call  __isoc99_scanf
25        movl  -16(%rbp), %eax     # load x
26        cvtsi2sd      %eax, %xmm0 # convert to double
27        movl  -12(%rbp), %eax     # load y
28        cvtsi2sd      %eax, %xmm1 # convert to double
29        divsd  %xmm1, %xmm0        # (double)x / (double)y
30        movsd  %xmm0, -8(%rbp)     # store z
31        movl  -12(%rbp), %edx     # load y
32        movl  -16(%rbp), %ecx     # load x
33        movq  -8(%rbp), %rax
34        movq  %rax, -24(%rbp)
35        movsd  -24(%rbp), %xmm0    # load z
36        movl  %ecx, %esi          # x to correct register
37        movl  $.LC2, %edi
38        movl  $1, %eax            # one xmm argument (in xmm0)
39        call  printf
40        movl  $0, %eax
41        leave
42        ret
43        .size  main, .-main
44        .ident"GCC: (Ubuntu/Linaro 4.7.0-7ubuntu3) 4.7.0"
45        .section      .note.GNU-stack,"",@progbits

Listing 14.3: Converting a fraction to a ﬂoat (gcc assembly language, 64-bit).

Before the division is performed, both integers must be converted to ﬂoating point. This takes place on lines 25 – 28:

25        movl  -16(%rbp), %eax     # load x
26        cvtsi2sd      %eax, %xmm0 # convert to double
27        movl  -12(%rbp), %eax     # load y
28        cvtsi2sd      %eax, %xmm1 # convert to double

The cvtsi2sd instruction on lines 26 and 28 converts a signed integer to a scalar double-precision ﬂoating point value. The signed integer can be either 32 or 64 bits and can be located in a general purpose register or in memory. The double-precision ﬂoat will be stored in the low-order 64 bits of the speciﬁed xmmn register. The high-order 64 bits to the xmmn register are not changed.

The division on line 29 leaves the result in the low-order 64 bits of xmm0, which is then stored in z:

29 divsd %xmm1, %xmm0 # (double)x / (double)y
30 movsd %xmm0, -8(%rbp) # store z

The movapd instruction moves the entire 128 bits, and the movsd instruction moves only the low-order 64 bits.

The ﬂoating point arguments are passed in the registers xmm0, xmm1, …, xmm15 in left-to-right order. So the value of z is loaded into the (xmm0) register for passing to the printf function, and the number of ﬂoating point values passed to it must be stored in eax:

35        movsd  -24(%rbp), %xmm0    # load z
36        movl  %ecx, %esi          # x to correct register
37        movl  $.LC2, %edi
38        movl  $1, %eax            # one xmm argument (in xmm0)
39        call  printf

14.5.2 x87 Floating Point Unit

The x87 FPU has eight 80-bit data registers, its own status register, and its own stack pointer. Floating point values are stored in the ﬂoating point data registers in an extended format.:

bit 79 is the sign bit: 0 for positive, 1 for negative.
bits 78 – 64 are for the exponent: 2’s complement, biased by 16383.
bit 63 is the integer: 1 for normalized.
bits 62 – 0 are for the fraction.

So there are 64 bits (63 – 0) for the signiﬁcand. Since there is no hidden bit in the extended format, one of these bits, bit 63, is required for the integer part.

Example 14-d ____________________________________________________________________________________________________

Show how 97.8125 is stored in 80-bit extended IEEE 754 binary format.

Solution:

First, convert the number to binary.

97.8125₁₀	= 1100001.1101₂
	(−1)⁰ ×1100001.1101× 2⁰

Adjust the exponent to obtain the normalized form.

(− 1)0 × 1100001.1101 × 20 = (− 1)0 × 1.1000011101× 26

Compute s, e+16383.

s	= 0
e + 16383	= 6 + 16383
	= 16389
	= 100000000000101₂

Filling in the bit patterns as speciﬁed above:

97.8125	is stored as	0 100000000000101 1 1000011101000000000000…0₂
	or	4005c3a0000000000000₁₆

Compare this with the 32-bit format in Example 14-a above.

__________________________________________________________________________________□

The 16-bit Floating Point Unit Status Word register shows the results of ﬂoating point operations. The meaning of each bit is shown in Table 14.4.

bit number	mnemonic	meaning
0	IE	invalid operation
1	DE	denormalized operation
2	ZE	zero divide
3	OE	overﬂow
4	UE	underﬂow
5	PE	precision
6	SF	stack fault
7	ES	error summary status
8	C0	condition code 0
9	C1	condition code 1
10	C2	condition code 2
11 – 13	TOP	top of stack
14	C3	condition code 3
15	B	FPU busy

Table 14.4: x87 Status Word.

Figure 14.2 shows a pictorial representation of the x87 ﬂoating point registers. The absolute locations are named fpr0, fpr1,…,fpr7 in this ﬁgure.

PICT

Figure 14.2: x87 ﬂoating point register stack. The fpri represent the absolute locations. The st(j) are the stack names, which are used by the instructions. In this example the top of the stack is at fpr3, as shown in bits 13 – 11 of the x87 status register.

The ﬂoating point registers are accessed by program instructions as a stack with st(0) being the register at the top of the stack. It “grows” from higher number registers to lower. The TOP ﬁeld (bits 13 – 11) in the FPU Status Word holds the (absolute) register number that is currently the top of the stack. If the stack is full, i.e., fpr0 is the top of the stack, a push causes the TOP ﬁeld to roll over, and the next item goes into register fpr7. (The value that was in fpr7 is lost.)

The instructions that read data from memory automatically push the value onto the top of the register stack. Arithmetic instructions are provided that operate on the value(s) on the top of the stack. For example, the faddp instruction adds the two values on the top of the stack and leaves their sum on the top. The stack has one less value on it. The original two values are gone.

Many ﬂoating point instructions allow the programmer to access any of the ﬂoating point registers, %st(i), where i = 0…7, relative to the top of the stack. Fortunately, the programmer does not need to keep track of where the top is. When using this format, %st(i) refers to the ith register from the top of the stack. For example, if fpr3 is the current top of the stack, the instruction

fadd %st(2), %st(0)

will add the value in the fpr5 register to the value in the fpr3 register, leaving the result in the fpr3 register.

Table 14.5 provides some examples of the ﬂoating point instruction set. Notice that the instructions that deal only with the ﬂoating point register stack do not use the size suﬃx letter, s.

mnemonic	source	destination	meaning

fadds	memﬂoat		add memﬂoat to st(0)

faddp			add st(0) to st(1) and pop register stack

fchs			change sign of st(0)

fcoms	memﬂoat		compare st(0) with memﬂoat

fcomp			compare st(0) with st(1) and pop register stack

fcos			replace st(0) with its cosine

fdivs	memﬂoat		divide st(0) by memﬂoat

fdivp			divide st(0) by st(1), store result in st(1), and pop register stack

filds	memint		convert integer at memint to 80-bit ﬂoat and push onto register stack

fists		memint	convert 80-bit ﬂoat at st(0) to int and store at memint

flds	memﬂoat		convert ﬂoat at memint to 80-bit ﬂoat and push onto register stack

fmuls	memﬂoat		multiply st(0) by memﬂoat

fmulp			multiply st(0) by st(1), store result in st(1), and pop register stack

fsin			replace st(0) with its sine

fsqrt			replace st(0) with its square root

fsts		memint	convert 80-bit ﬂoat at st(0) to s size ﬂoat and store at memint

fsubs	memﬂoat		subtract memﬂoat from st(0)

fsubp			subtract st(0) from st(1) and pop register stack


s = s, l, t

Table 14.5: A sampling of x87 ﬂoating point instructions. Size characters are: s = 32-bit, l = 64-bit, t = 80-bit.

To avoid ambiguity the gnu assembler requires a single letter suﬃx on the ﬂoating point instructions that access memory. The suﬃxes are:

’s’	for single precision – 32-bit
’l’	for long (or double) precision – 64-bit
’t’	for ten-byte – 80-bit

Most of the ﬂoating point instructions have several variants. See [2] – [6] and [14] – [18] for details. In general,

Data cannot be moved directly between the integer and ﬂoating point registers. Only data stored in memory or another ﬂoating point register can be pushed onto the ﬂoating point register stack.
st(0) is always involved when performing ﬂoating point arithmetic.
Many ﬂoating point instructions have a pop variant. The mnemonic includes a ‘p’ after the basic mnemonic, immediately before the size character. For example,
fistl someplace(%ebp)

converts the 80-bit ﬂoating point number in st(0) to a 32-bit integer and stores it at the speciﬁed memory location. Using the pop variant,
fistpl someplace(%ebp)

does the same thing but also pops one from the ﬂoating point register stack.

Compiling the fraction conversion program of Listing 14.2 in 32-bit mode shows (Listing 14.4) that the compiler uses the x87 ﬂoating-point instructions. This ensures backward compatibility since the x86-32 architecture does not need to include SSE instructions.

1        .file  "frac2float.c"
2        .section      .rodata
3.LC0:
4        .string"Enter two integers: "
5.LC1:
6        .string"%i %i"
7.LC2:
8        .string"%i / %i = %lf\n"
9        .text
10        .globlmain
11        .type  main, @function
12main:
13        pushl  %ebp
14        movl  %esp, %ebp
15        andl  $-16, %esp
16        subl  $64, %esp
17        movl  $.LC0, (%esp)
18        call  printf
19        leal  52(%esp), %eax   # address of x
20        movl  %eax, 8(%esp)
21        leal  48(%esp), %eax   # address of y
22        movl  %eax, 4(%esp)
23        movl  $.LC1, (%esp)
24        call  __isoc99_scanf
25        movl  48(%esp), %eax   # load y
26        movl  %eax, 44(%esp)   # needs to be in memory to
27        fildl  44(%esp)         # convert to 80-bit float and push
28        movl  52(%esp), %eax   # load x
29        movl  %eax, 44(%esp)
30        fildl  44(%esp)         # convert to 80-bit float and push
31        fdivrp%st, %st(1)       # (long double)x / (long double) y
32        fstpl  56(%esp)         # store z
33        movl  52(%esp), %edx   # load x
34        movl  48(%esp), %eax   # load y
35        fldl  56(%esp)         # load z
36        fstpl  12(%esp)         # put z on call stack
37        movl  %edx, 8(%esp)    # put x on call stack
38        movl  %eax, 4(%esp)    # put y on call stack
39        movl  $.LC2, (%esp)
40        call  printf
41        movl  $0, %eax
42        leave
43        ret
44        .size  main, .-main
45        .ident"GCC: (Ubuntu/Linaro 4.7.0-7ubuntu3) 4.7.0"
46        .section      .note.GNU-stack,"",@progbits

Listing 14.4: Converting a fraction to a ﬂoat (gcc assembly language, 32-bit).

We add comments to lines 19 and 21 to show where the x and y variables are located in the stack frame.

19        leal  52(%esp), %eax   # address of x
20        movl  %eax, 8(%esp)
21        leal  48(%esp), %eax   # address of y
22        movl  %eax, 4(%esp)
23        movl  $.LC1, (%esp)
24        call  __isoc99_scanf

Rather than actually push the arguments onto the stack, enough space was allocated on the stack (line 15) to directly store the values in the location where they would be if they had been pushed there. This is more eﬃcient that pushing each argument.

Casting an int to a float requires a conversion in the storage format. This conversion is done by the x87 FPU as an integer is pushed onto the ﬂoating point register stack using the fildl instruction. This conversion can only be done to an integer that is stored in memory. The compiler uses a location on the call stack to temporarily store each integer so it can be converted:

25        movl  48(%esp), %eax   # load y
26        movl  %eax, 44(%esp)   # needs to be in memory to
27        fildl  44(%esp)         # convert to 80-bit float and push
28        movl  52(%esp), %eax   # load x
29        movl  %eax, 44(%esp)
30        fildl  44(%esp)         # convert to 80-bit float and push

Each of the fildl instructions on lines 27 and 30 converts a 32-bit integer to an 80-bit ﬂoat and pushes the ﬂoat onto the x87 register stack. At this point the ﬂoating-point equivalent of x is at the top of the stack, and the ﬂoating-point equivalent of y is immediately below it. Then the ﬂoating-point division instruction:

31 fdivrp%st, %st(1) # (long double)x / (long double) y

divides the number at st(0) (the (0) can be omitted) by the number at st(1) and pops the x87 register stack so that the result is now at the top of the stack.

Finally, the fstpl instruction is used to pop the value oﬀ the top of the x87 register stack and store it in memory — at its proper location on the call stack. The “l” suﬃx indicates that 64 bits of memory should be used for storing the ﬂoating-point value. So the 80-bit value on the top of the x87 register stack is rounded to 64 bits as it is stored in memory. The other three arguments are also stored on the call stack.

32        fstpl  56(%esp)         # store z
33        movl  52(%esp), %edx   # load x
34        movl  48(%esp), %eax   # load y
35        fldl  56(%esp)         # load z
36        fstpl  12(%esp)         # put z on call stack
37        movl  %edx, 8(%esp)    # put x on call stack
38        movl  %eax, 4(%esp)    # put y on call stack

Note that the 32-bit version of printf does not receive arguments in registers, so eax is not used.

14.5.3 3DNow! Floating Point

The 3DNow! instructions use the low-order 64 bits in the same physical registers at the x87. These 64-bit portions are named mmx0, mmx1,…, mmx7. They are used as ﬁxed register, not in a stack conﬁguration. Execution of a 3DNow! instruction changes the TOP ﬁeld (bits 13 – 11) in the MXCSR status register, so the top of stack is lost for any subsequent x87 instructions. The bottom line is that x87 and 3DNow! instructions cannot be used simultaneously.

Another limitation of the 3DNow! instructions set is that it only handles 32-bit ﬂoating point.

These limitations of the 3DNow! instruction set make it essentially obsolete in the x86-64 architecture, so it will not be discussed further in this book.

14.6 Comments About Numerical Accuracy

Beginning programmers often see ﬂoating point arithmetic as more accurate than integer. It is true that even adding two very large integers can cause overﬂow. Multiplication makes it even more likely that the result will be very large and, thus, overﬂow. And when used with two integers, the / operator in C/C++ causes the fractional part to be lost. However, as you have seen in this chapter, ﬂoating point representations have their own set of inaccuracies.

Arithmetically accurate results require a thorough analysis of your algorithm. Some points to consider:

Try to scale the data such that integer arithmetic can be used.
All ﬂoating point computations are performed in 80-bit extended format. So there is no processing speed improvement from using floats instead of doubles.
Try to arrange the order of computations so that similarly sized numbers are added or subtracted.
Avoid complex arithmetic statements, which may obscure incorrect intermediate results.
Choose test data that “stresses” your algorithm. For example, 0.00390625 can be stored exactly in eight bits, but 0.1 has no exact binary equivalent.

14.7 Instructions Introduced Thus Far

This summary shows the assembly language instructions introduced thus far in the book. The page number where the instruction is explained in more detail, which may be in a subsequent chapter, is also given. This book provides only an introduction to the usage of each instruction. You need to consult the manuals ([2] – [6], [14] – [18]) in order to learn all the possible uses of the instructions.

14.7.1 Instructions

data movement:
opcode	source	destination	action	page

cbtw			convert byte to word, al → ax	696

cwtl			convert word to long, ax → eax	696

cltq			convert long to quad, eax → rax	696

cwtd			convert word to long, ax → dx:ax	786

cltd			convert long to quad, eax → edx:eax	786

cqto			convert quad to octuple, rax → rdx:rax	786

cmovcc	%reg/mem	%reg	conditional move	706

movs	$imm/%reg	%reg/mem	move	506

movs	mem	%reg	move	506

movsss	$imm/%reg	%reg/mem	move, sign extend	693

movzss	$imm/%reg	%reg/mem	move, zero extend	693

popw		%reg/mem	pop from stack	566

pushw	$imm/%reg/mem		push onto stack	566


s = b, w, l, q; w = l, q; cc = condition codes

arithmetic/logic:
opcode	source	destination	action	page

adds	$imm/%reg	%reg/mem	add	607

adds	mem	%reg	add	607

ands	$imm/%reg	%reg/mem	bit-wise and	747

ands	mem	%reg	bit-wise and	747

cmps	$imm/%reg	%reg/mem	compare	676

cmps	mem	%reg	compare	676

decs	%reg/mem		decrement	699

divs	%reg/mem		unsigned divide	777

idivs	%reg/mem		signed divide	784

imuls	%reg/mem		signed multiply	775

incs	%reg/mem		increment	698

leaw	mem	%reg	load eﬀective address	579

muls	%reg/mem		unsigned multiply	769

negs	%reg/mem		negate	789

ors	$imm/%reg	%reg/mem	bit-wise inclusive or	747

ors	mem	%reg	bit-wise inclusive or	747

sals	$imm/%cl	%reg/mem	shift arithmetic left	756

sars	$imm/%cl	%reg/mem	shift arithmetic right	751

shls	$imm/%cl	%reg/mem	shift left	756

shrs	$imm/%cl	%reg/mem	shift right	751

subs	$imm/%reg	%reg/mem	subtract	612

subs	mem	%reg	subtract	612

tests	$imm/%reg	%reg/mem	test bits	676

tests	mem	%reg	test bits	676

xors	$imm/%reg	%reg/mem	bit-wise exclusive or	747

xors	mem	%reg	bit-wise exclusive or	747


s = b, w, l, q; w = l, q

program ﬂow control:
opcode	location	action	page

call	label	call function	546

ja	label	jump above (unsigned)	683

jae	label	jump above/equal (unsigned)	683

jb	label	jump below (unsigned)	683

jbe	label	jump below/equal (unsigned)	683

je	label	jump equal	679

jg	label	jump greater than (signed)	686

jge	label	jump greater than/equal (signed)	686

jl	label	jump less than (signed)	686

jle	label	jump less than/equal (signed)	686

jmp	label	jump	691

jne	label	jump not equal	679

jno	label	jump no overﬂow	679

jcc	label	jump on condition codes	679

leave		undo stack frame	580

ret		return from function	583

syscall		call kernel function	587


cc = condition codes

x87 ﬂoating point:
opcode	source	destination	action	page

fadds	memﬂoat		add	859

faddp			add/pop	859

fchs			change sign	859

fcoms	memﬂoat		compare	859

fcomp			compare/pop	859

fcos			cosine	859

fdivs	memﬂoat		divide	859

fdivp		divide/pop	859

filds	memint		load integer	859

fists		memint	store integer	859

flds	memﬂoat		load ﬂoating point	859

fmuls	memﬂoat		multiply	859

fmulp			multiply/pop	859

fsin			sine	859

fsqrt			square root	859

fsts		memint	ﬂoating point store	859

fsubs	memﬂoat		subtract	859

fsubp			subtract/pop	859


s = s, l, t

SSE ﬂoating point conversion:
opcode	source	destination	action	page

cvtsd2si	%xmmreg/mem	%reg	scalar double to signed integer	845

cvtsd2ss	%xmmreg	%xmmreg/%reg	scalar double to single ﬂoat	845

cvtsi2sd	%reg	%xmmreg/mem	signed integer to scalar double	845

cvtsi2sdq	%reg	%xmmreg/mem	signed integer to scalar double	845

cvtsi2ss	%reg	%xmmreg/mem	signed integer to scalar single	845

cvtsi2ssq	%reg	%xmmreg/mem	signed integer to scalar single	845

cvtss2sd	%xmmreg	%xmmreg/mem	scalar single to scalar double	845

cvtss2si	%xmmreg/mem	%reg	scalar single to signed integer	845

cvtss2siq	%xmmreg/mem	%reg	scalar single to signed integer	845

14.7.2 Addressing Modes

__________________________________________________________

register direct:	The data value is located in a CPU register.
	syntax: name of the register with a “%” preﬁx.
	example: movl %eax, %ebx

immediate data:	The data value is located immediately after the instruction. Source operand only.
	syntax: data value with a “$” preﬁx.
	example: movl $0xabcd1234, %ebx

base register plus oﬀset:	The data value is located in memory. The address of the memory location is the sum of a value in a base register plus an oﬀset value.
	syntax: use the name of the register with parentheses around the name and the oﬀset value immediately before the left parenthesis.
	example: movl $0xaabbccdd, 12(%eax)

rip-relative:	The target is a memory address determined by adding an oﬀset to the current address in the rip register.
	syntax: a programmer-deﬁned label
	example: je somePlace

indexed:	The data value is located in memory. The address of the memory location is the sum of the value in the base_register plus scale times the value in the index_register, plus the oﬀset.
	syntax: place parentheses around the comma separated list (base_register, index_register, scale) and preface it with the oﬀset.
	example: movl $0x6789cdef, -16(%edx, %eax, 4)

14.8 Exercises

14-1

(§14.1) Develop an algorithm for converting decimal fractions to binary. Hint: Multiply both sides of Equation 14.1 by two.

14-2

(§14.1) Show that two’s complement works correctly for fractional values. What is the decimal range of 8-bit, two’s complement fractional values? Hint: +0.5 does not exist, but -0.5 does.

14-3

(§14.3) Copy the following program and run it:

1/*
2 * exer14_3.c
3 * Use float for Loop Control Variable?
4 * Bob Plantz - 18 June 2009
5 */
6
7#include <stdio.h>
8
9int main()
10{
11    float number;
12    int counter = 20;
13
14    number = 0.5;
15    while ((number != 0.0) && (counter > 0))
16    {
17        printf("number = %.10f and counter = %i\n", number, counter);
18
19        number -= 0.1;
20        counter -= 1;
21    }
22
23    return 0;
24}

Listing 14.5: Use ﬂoat for Loop Control Variable?

Explain the behavior. What happens if you change the decrement of number from 0.1 to 0.0625? Explain.

14-4

(§14.3 §14.4) Copy the following program and run it:

1/*
2 * exer14_3.c
3 * Are floats accurate?
4 * Bob Plantz - 18 June 2009
5 */
6
7#include <stdio.h>
8
9int main()
10{
11    float fNumber = 2147483646.0;
12    int iNumber = 2147483646;
13
14    printf(" Before adding the float is %f\n", fNumber);
15    printf("         and the integer is %i\n\n", iNumber);
16    fNumber += 1.0;
17    iNumber += 1;
18    printf("After adding 1 the float is %f\n", fNumber);
19    printf("         and the integer is %i\n", iNumber);
20
21    return 0;
22}

Listing 14.6: Are ﬂoats accurate?

Explain the behavior. What is the maximum value of fNumber such that adding 1.0 to it works?

14-5

(§14.4) Convert the following decimal numbers to 32-bit IEEE 754 format by hand:

a) 1.0 b) -0.1 c) 2005.0 d) 0.00390625 e) -3125.3125 f) 0.33 g) -0.67 h) 3.14

14-6

(§14.4) Convert the following 32-bit IEEE 754 bit patterns to decimal.

a) 4000 0000 b) bf80 0000 c) 3d80 0000 d) c180 4000 e) 42c8 1000 f) 3f99 999a g) 42f6 e666 h) c259 48b4

14-7

(§14.4) Show that half the floats (in 32-bit IEEE 754 format) are between -2.0 and +2.0.

14-8

(§14.5) The following C program

1/*
2 * casting.c
3 * Casts two integers to floats and adds them.
4 * Bob Plantz - 18 June 2009
5 */
6
7#include <stdio.h>
8
9int main()
10{
11    int x;
12    double y, z;
13
14    printf("Enter an integer: ");
15    scanf("%i", &x);
16    y = 1.23;
17    z = (double)x + y;
18    printf("%i + %lf = %lf\n", x, y, z);
19
20    return 0;
21}

Listing 14.7: Casting integer to ﬂoat in C.

was compiled with the -S option to produce

1        .file  "casting.c"
2        .section      .rodata
3.LC0:
4        .string"Enter an integer: "
5.LC1:
6        .string"%i"
7.LC3:
8        .string"%i + %lf = %lf\n"
9        .text
10        .globlmain
11        .type  main, @function
12main:
13        pushq  %rbp
14        movq  %rsp, %rbp
15        subq  $48, %rsp
16        movl  $.LC0, %edi
17        movl  $0, %eax
18        call  printf
19        leaq  -20(%rbp), %rax
20        movq  %rax, %rsi
21        movl  $.LC1, %edi
22        movl  $0, %eax
23        call  __isoc99_scanf
24        movabsq$4608218246714312622, %rax
25        movq  %rax, -16(%rbp)
26        movl  -20(%rbp), %eax
27        cvtsi2sd      %eax, %xmm0
28        addsd  -16(%rbp), %xmm0
29        movsd  %xmm0, -8(%rbp)
30        movl  -20(%rbp), %ecx
31        movq  -8(%rbp), %rdx
32        movq  -16(%rbp), %rax
33        movq  %rdx, -40(%rbp)
34        movsd  -40(%rbp), %xmm1
35        movq  %rax, -40(%rbp)
36        movsd  -40(%rbp), %xmm0
37        movl  %ecx, %esi
38        movl  $.LC3, %edi
39        movl  $2, %eax
40        call  printf
41        movl  $0, %eax
42        leave
43        ret
44        .size  main, .-main
45        .ident"GCC: (Ubuntu/Linaro 4.7.0-7ubuntu3) 4.7.0"
46        .section      .note.GNU-stack,"",@progbits

Listing 14.8: Casting integer to ﬂoat in assembly language.

Identify the assembly language sequence that performs the C sequence

15 y = 1.23;
16 z = (double)x + y;

and describe what occurs.

[next] [prev] [prev-tail] [front] [up]

Chapter 14Fractional Numbers

14.1 Fractions in Binary

14.2 Fixed Point ints

14.3 Floating Point Format

14.4 IEEE 754

14.5 Floating Point Hardware

14.5.1 SSE2 Floating Point

14.5.2 x87 Floating Point Unit

14.5.3 3DNow! Floating Point

14.6 Comments About Numerical Accuracy

14.7 Instructions Introduced Thus Far

14.7.1 Instructions

14.7.2 Addressing Modes

14.8 Exercises

Chapter 14
Fractional Numbers