So far in this book we have used only integers for numerical values. In this chapter you will see two methods for storing fractional values — ﬁxed point and ﬂoating point. Storing numbers in ﬁxed point format requires that the programmer keep track of the location of the binary point1 within the bits allocated for storage. In the ﬂoating point format, the number is essentially written in scientiﬁc format and both the signiﬁcand2 and exponent are stored.
Before discussing the storage formats, we need to think about how fractional values are stored in binary. The concept is quite simple. We can extend Equation 2.6:
|d−1 × 2−1||= 1× 0.5|
|d−2 × 2−2||= 0× 0.25|
|d−3 × 2−3||= 1× 0.125|
|d−4 × 2−4||= 1× 0.0625|
|0.10112||= 0.510 + 0.12510 + 0.062510|
See Exercise 14-1 for an algorithm to convert decimal fractions to binary. We assume that you can convert the integral part and that Equation 14.1 is suﬃcient for converting from binary to decimal.
Although any integer can be represented as a sum of powers of two, an exact representation of fractional values in binary is limited to sums of inverse powers of two. For example, consider an 8-bit representation of the fractional value 0.9. From
we can see that
where 1100 means that this bit pattern repeats forever.
Rounding oﬀ fractional values in binary is very simple. If the next bit to the right is one, add one to the bit position where rounding oﬀ. In the above example, we are rounding oﬀ to eight bits. The ninth bit to the right of the binary point is zero, so we do not add one in the eighth bit position. Thus, we use
which gives a round oﬀ error of
|0.910 −0.111001102||= 0.910 − 0.898437510|
We note here that two’s complement also works correctly for storing negative fractional values. You are asked to show this in Exercise 14-2.
In a ﬁxed point format, the storage area is divided into the integral part and the fractional part. The programmer must keep track of where the binary point is located. For example, we may decide to divide a 32-bit int in half and use the high-order 16 bits for the integral part and the low-order 16 bits for the fractional part.
My bank provides me with printed deposit slips that use this method. There are seven boxes for numerals. There is also a decimal point printed just to the left of the rightmost two boxes. Note that the decimal point does not occupy a box. That is, there are no digits allocated for the decimal point. So the bank assumes up to ﬁve decimal digits for the “dollars” part (rather optimistic), and the rightmost two decimal digits represent the “cents” part. The bank’s printing tells me how they have allocated the digits, but it is my responsibility to keep track of the location of the decimal point when ﬁlling in the digits.
One advantage of a ﬁxed point format is that integer instructions can be used for arithmetic computations. Of course, the programmer must be very careful to keep track of which bits are allocated for the integral part and which for the fractional part. And the range of possible values is restricted by the number of bits.
An example of using ints for ﬁxed point addition is shown in Listing 14.1.
The numbers are input to the nearest 1/16th inch, so the programmer has allocated four bits for the fractional part. This leaves 28 bits for the integral part. After the integral part is read, the stored number must be shifted four bit positions to the left to put it in the high-order 28 bits. Then the fractional part (in number of sixteenths) is added into the low-order four bits with a simple bit-wise or operation. Printing the answer also requires some bit shifting and some masking to ﬁlter out the fractional part.
This is clearly a contrived example. A program using floats would work just as well and be somewhat easier to write. However, the program in Listing 14.1 uses integer instructions, which execute faster than ﬂoating point. The hardware issues have become less signiﬁcant in recent times. Modern CPUs use various parallelization schemes such that a mix of ﬂoating point and integer instructions may actually execute faster than only integer instructions. Fixed point arithmetic is often used in embedded applications where the CPU is small and may not have ﬂoating point capabilities.
The most important concept in this section is that ﬂoating point numbers are not real numbers.3 Real numbers include the continuum of all numbers from −∞ to +∞. You already understand that computers are ﬁnite, so there is certainly a limit on the largest values that can be represented. But the problem is much worse than simply a size limit.
As you will see in this section, ﬂoating point numbers comprise a very small subset of real numbers. There are signiﬁcant gaps between adjacent ﬂoating point numbers. These gaps can produce the following types of errors:
To make matters worse, these errors can occur in intermediate results, where they are very diﬃcult to debug.
The idea behind ﬂoating point formats is to think of numbers written in scientiﬁc format. This notation requires two numbers to completely specify a value — a signiﬁcand and an exponent. To review, a decimal number is written in scientiﬁc notation as a signiﬁcand times ten raised to an exponent. For example,
|1,024||= 1.024 × 103||(14.7)|
|− 0.000089372||= −8.9372 × 10−5||(14.8)|
Notice that the number is normalized such that only one digit appears to the left of the decimal point. The exponent of 10 is adjusted accordingly.
If we agree that each number is normalized and that we are working in base 10, then each ﬂoating point number is completely speciﬁed by three items:
That is, in the above examples
The advantage of using a ﬂoating point format is that, for a given number of digits, we can represent a larger range of values. To illustrate this, consider a four-digit, unsigned decimal system. The range of integers that could be represented is
Now, let’s allocate two digits for the signiﬁcand and two for the exponent. We will restrict our scheme to unsigned numbers, but we will allow negative exponents. So we will need to use one of the exponent digits to store the sign of the exponent. We will use 0 for positive and 1 for negative. For example, 3.9 × 10−4 would be stored
where each box holds one decimal digit. Some other examples are:
|1000||represents||1.0 × 100|
|3702||represents||3.7 × 102|
|9316||represents||9.3 × 10−6|
Our normalization scheme requires that there be a single non-zero digit to the left of the decimal point. We should also allow the special case of 0.0:
A little thought shows that this scheme allows numbers in the range
That is, we have increased the range of possible values by a factor of 1014! However, it is important to realize that in both storage schemes, the integer and the ﬂoating point, we have exactly the same number of possible values — 104.
Although ﬂoating point formats can provide a much greater range of numbers, the distance between any two adjacent numbers depends upon the value of the exponent. Thus, ﬂoating point is generally less accurate than an integer representation, especially for large numbers.
To see how this works, let’s look at a plot of numbers (using our current scheme) in the range
Notice that the larger numbers are further apart than the smaller ones. (See Exercise 14-7 after you read Section 14.4.)
Let us pick some numbers from this range and perform some addition.
|9111||represents||9.1 × 10−1|
|9311||represents||9.3 × 10−1|
If we add these values, we get 0.91 + 0.93 = 1.84. Now we need to round oﬀ our “paper” addition in order to ﬁt this result into our current ﬂoating point scheme:
|1800||represents||1.8 × 100|
On the other hand,
|9411||represents||9.4 × 10−1|
|9311||represents||9.3 × 10−1|
and adding these values, we get 0.94 + 0.93 = 1.87. Rounding oﬀ, we get:
|1900||represents||1.9 × 100|
So we see that starting with two values expressed to the nearest 1/100th, their sum is accurate only to the nearest 1/10.
To compare this with ﬁxed point arithmetic, we could use the same four digits to store 0.93 this way
It is clear that this storage scheme allows us to perform both additions (0.91 + 0.93 and 0.94 + 0.93) and store the results exactly.
These round oﬀ errors must be taken into account when performing ﬂoating point arithmetic. In particular, the errors can occur in intermediate results when doing even moderately complex computations, where they are very diﬃcult to detect.
Speciﬁc ﬂoating point formats involve trade-oﬀs between re4solution, round oﬀ errors, size, and range. The most commonly used formats are the IEEE 754.4 They range in size from four to sixteen bytes. The most common sizes used in C/C++ are floats (4 bytes) and doubles (8 bytes). The x86 processor performs ﬂoating point computations using an extended 10-byte format. The results are rounded to 4-byte mode if the programmer uses the float data type or 8-byte mode for the double data type.
In the IEEE 754 4-byte format, one bit is used for the sign, eight for the exponent, and twenty-three for the signiﬁcand. The IEEE 754 8-byte format speciﬁes one bit for the sign, eleven for the exponent, and ﬁfty-two for the signiﬁcand.
In this section we describe the 4-byte format in order to save ourselves (hand) computation eﬀort. The goal is to get a feel for the limitations of ﬂoating point formats. The normalized form of the number in binary is given by Equation 14.9.
|where:||s is the sign bit|
|f is the 23-bit fractional part|
|e is the 8-bit exponent|
The bit patterns for floats and doubles are arranged as shown in Figure 14.1.
As in decimal, the exponent is adjusted such that there is only one non-zero digit to the left of the binary point. In binary, though, this digit is always one. Since it is always one, it need not be stored. Only the fractional part of the normalized value needs to be stored as the signiﬁcand. This adds one bit to the signiﬁcance of the fractional part. The integer part (one) that is not stored is sometimes called the hidden bit.
The sign bit, s, refers to the number. Another mechanism is used to represent the sign of the exponent, e. Your ﬁrst thought is probably to use two’s complement. However, the IEEE format was developed in the 1970s, when ﬂoating point computations took a lot of CPU time. Many algorithms depend upon only the comparison of two numbers, and the computer scientists of the day realized that a format that allowed integer comparison instructions would result in faster execution times. So they decided to store a biased exponent as an unsigned int. The amount of the bias is one-half the range allocated for the exponent. In the case of an 8-bit exponent, the bias amount is 127.
Example 14-a ____________________________________________________________________________________________________
Show how 97.8125 is stored in 32-bit IEEE 754 binary format.
First, convert the number to binary.
|= (−1)0 ×1100001.1101× 20|
Adjust the exponent to obtain the normalized form.
Compute s, e+127, and f.
|e + 127||= 6 + 127|
Finally, use Figure 14.1 to place the bit patterns. (Remember that the hidden bit is not stored; it is understood to be there.)
|97.8125||is stored as||0 10000101 100001110100000000000002|
Example 14-b ____________________________________________________________________________________________________
Using IEEE 754 32-bit format, what decimal number does the bit pattern 3e40000016 represent?
First, convert the hexadecimal to binary, using spaces suggested by Figure 14.1.
Now compute the values of s, e, and f.
|e + 127||= 011111002|
Finally, plug these values into Equation 14.9. (Remember to add the hidden bit.)
|(−1)0 ×1.100…00× 2−3||= (−1)0 ×0.0011× 20|
Example 14-c ____________________________________________________________________________________________________
Using IEEE 754 32-bit format, what decimal number would the bit pattern 0000000016 represent? (The speciﬁcation states that it is an exception to the format and is deﬁned to represent 0.0. This example provides some motivation for this exception.)
The conversion to binary is trivial. Computing the values of s, e, and f.
|e + 127||= 000000002|
Finally, plug these values into Equation 14.9. (Remember to add the hidden bit.)
This last example illustrates a problem with the hidden bit — there is no way to represent zero. To address this issue, the IEEE 754 standard has several special cases.
Until the introduction of the Intel 486DX in4 April 1989, the x87 Floating Point Unit was on a separate chip. It is now included on the CPU chip although it uses a somewhat diﬀerent execution architecture than the Integer Unit in the CPU.
In 1997 Intel added MMX™(Multimedia Extensions) to their processors, which includes instructions that process multiple data values with a single instruction (SIMD). Operations on single data items are called scalar operations, while those on multiple data items in parallel are called vector operations. Vector operations are useful for many multi-media and scientiﬁc applications. In this book we will discuss only scalar ﬂoating point operations. Originally, MMX only performed integer computations. But in 1998 AMD added the 3DNow!™extension to MMX, which includes ﬂoating point instructions. Intel soon followed suit.
Intel then introduced SSE (Streaming SIMD Extension) on the Pentium III in 1999. Several versions have evolved over the years — SSE, SSE2, SSE3, and SSE4A — as of this writing. There are instructions for performing both integer and ﬂoating point operations on both scalar and vector values.
The x86-64 architecture includes three sets of instructions for working with ﬂoating point values:
All three ﬂoating point instruction sets include a wide variety of instructions to perform the following operations:
In addition, the x87 includes instructions for transcendental functions — sine, cosine, tangent, and arc tangent, and logarithm functions.
We will not cover all the instructions in this book. The following subsections provide an introduction to how each of the three sets of instructions is used. See the manuals  –  and  –  for details.
Most of the SSE2 instructions operate on multiple data items simultaneously — single instruction, multiple data (SIMD). There are SSE2 instructions for both integer and ﬂoating point operations. Integer instructions operate on up to sixteen 8-bit, eight 16-bit, four 32-bit, two 64-bit, or one 128-bit integers at a time. Vector ﬂoating point instructions operate on all four 32-bit or both 64-bit ﬂoats in a register simultaneously. Each data item is treated independently. These instructions are useful for algorithms that do things like process arrays. One SSE2 instruction can operate on several array elements in parallel, resulting in considerable speed gains. Such algorithms are common in multi-media and scientiﬁc applications.
In this book we will only consider some of the scalar ﬂoating-point instructions, which operate on only single data items. The SSE2 instructions are the preferred ﬂoating-point implementation in 64-bit mode. These instructions operate on either 32-bit (single-precision) or 64-bit (double-precision) values. The scalar instructions operate on only the low-order portion of the 128-bit xmm registers, with the high-order 64 or 96 bits remaining unchanged.
SSE includes a 32-bit MXCSR register that has ﬂags for handling ﬂoating-point arithmetic errors. These ﬂags are shown in Table 14.1.
|31 – 18||–||reserved|
|17||MM||Misaligned Exception Mask||0|
|15||FZ||Flush-toZero for Masked Underﬂow||0|
|14 – 13||RC||Floating-Point Rounding Control||00|
|12||PM||Precision Exception Mask||1|
|11||UM||Underﬂow Exception Mask||1|
|10||OM||Overﬂow Exception Mask||1|
|9||ZM||Zero-Divide Exception Mask||1|
|8||DM||Denormalized-Operand Exception Mask||1|
|7||IM||Invalid-Operation Exception Mask||1|
|6||DAZ||Denormals Are Zero||0|
|4||UE||Underﬂow Exception 4||0|
SSE instructions that perform arithmetic operations and the SSE compare instructions also aﬀect the status ﬂags in the rflags register. Thus the regular conditional jump instructions (Section 10.1.2, page 676) are used to control program ﬂow based on ﬂoating-point computations.
The instruction mnemonics used by the gnu assembler are mostly the same as given in the manuals,  –  and  – . Since they are quite descriptive with respect to operand sizes, a size letter is not appended to the mnemonic, except when one of the operands is in memory and the size is ambiguous. Of course, the operand order used by the gnu assembler is still reversed compared to the manufacturers’ manuals, and the register names are preﬁxed with the “%” character.
A very important set of instructions provided for working with ﬂoating point values are those to convert between integer and ﬂoating point formats. The scalar conversion SSE2 instructions are shown in Table 14.2.
|cvtsd2si||xmm reg.||32-bit integer||convert scalar 64-bit|
|or mem.||reg.||ﬂoat to 32-bit integer|
|cvtsd2ss||xmm reg.||xmm reg. or||convert scalar 64-bit|
|mem.||ﬂoat to 32-bit ﬂoat|
|cvtsi2sd||integer reg.||xmm reg.||convert 32-bit integer|
|or mem.||to scalar 64-bit ﬂoat|
|cvtsi2sdq||integer reg.||xmm reg.||convert 64-bit integer|
|or mem.||to scalar 64-bit ﬂoat|
|cvtsi2ss||integer reg.||xmm reg.||convert 32-bit integer|
|or mem.||to scalar 32-bit ﬂoat|
|cvtsi2ssq||integer reg.||xmm reg.||convert 64-bit integer|
|or mem.||to scalar 32-bit ﬂoat|
|cvtss2sd||xmm reg.||another xmm reg.||convert scalar 32-bit|
|or mem.||ﬂoat to 64-bit ﬂoat|
|cvtss2si||xmm reg.||32-bit integer||convert scalar 32-bit|
|or mem.||reg.||ﬂoat to 32-bit integer|
|cvtss2siq||xmm reg.||64-bit integer||convert scalar 32-bit|
|or mem.||reg.||ﬂoat to 64-bit integer|
Data movement and arithmetic instructions must distinguish between scalar and vector operations on values in the xmm registers. The low-order portion of the register is used for scalar operations. Vector operations are performed on multiple data values packed into a single register. See Table 14.3 for a sampling of SSE2 data movement and arithmetic instructions.
|addps||xmm reg. or mem.||xmm reg.||add packed 32-bit ﬂoats|
|addpd||xmm reg. or mem.||xmm reg.||add packed 64-bit ﬂoats|
|addss||xmm reg. or mem.||xmm reg.||add scalar 32-bit ﬂoats|
|addsd||xmm reg. or mem.||xmm reg.||add scalar 64-bit ﬂoats|
|divps||xmm reg. or mem.||xmm reg.||divide packed 32-bit ﬂoats|
|divpd||xmm reg. or mem.||xmm reg.||divide packed 64-bit ﬂoats|
|divss||xmm reg. or mem.||xmm reg.||divide scalar 32-bit ﬂoats|
|divsd||xmm reg. or mem.||xmm reg.||divide scalar 64-bit ﬂoats|
|movss||xmm reg. or mem.||xmm reg.||move scalar 32-bit ﬂoat|
|movss||xmm reg.||xmm reg. or mem.||move scalar 32-bit ﬂoat|
|movsd||xmm reg. or mem.||xmm reg.||move scalar 64-bit ﬂoat|
|movsd||xmm reg.||xmm reg. or mem.||move scalar 64-bit ﬂoat|
|mulps||xmm reg. or mem.||xmm reg.||multiply packed 32-bit ﬂoats|
|mulpd||xmm reg. or mem.||xmm reg.||multiply packed 64-bit ﬂoats|
|mulss||xmm reg. or mem.||xmm reg.||multiply scalar 32-bit ﬂoats|
|mulsd||xmm reg. or mem.||xmm reg.||multiply scalar 64-bit ﬂoats|
|subps||xmm reg. or mem.||xmm reg.||subtract packed 32-bit ﬂoats|
|subpd||xmm reg. or mem.||xmm reg.||subtract packed 64-bit ﬂoats|
|subss||xmm reg. or mem.||xmm reg.||subtract scalar 32-bit ﬂoats|
|subsd||xmm reg. or mem.||xmm reg.||subtract scalar 64-bit ﬂoats|
Notice that the code for the basic operation is followed by a “p” or “s” for “packed” or “scalar.” This character is then followed by a “d” or “s” for “double” (64-bit) or “single” (32-bit) data item.
We will use the program in Listing 14.2 to illustrate a few ﬂoating point operations.
Compiling this program in 64-bit mode produced the assembly language in Listing 14.3.
Before the division is performed, both integers must be converted to ﬂoating point. This takes place on lines 25 – 28:
The cvtsi2sd instruction on lines 26 and 28 converts a signed integer to a scalar double-precision ﬂoating point value. The signed integer can be either 32 or 64 bits and can be located in a general purpose register or in memory. The double-precision ﬂoat will be stored in the low-order 64 bits of the speciﬁed xmmn register. The high-order 64 bits to the xmmn register are not changed.
The division on line 29 leaves the result in the low-order 64 bits of xmm0, which is then stored in z:
The movapd instruction moves the entire 128 bits, and the movsd instruction moves only the low-order 64 bits.
The ﬂoating point arguments are passed in the registers xmm0, xmm1, …, xmm15 in left-to-right order. So the value of z is loaded into the (xmm0) register for passing to the printf function, and the number of ﬂoating point values passed to it must be stored in eax:
The x87 FPU has eight 80-bit data registers, its own status register, and its own stack pointer. Floating point values are stored in the ﬂoating point data registers in an extended format.:
So there are 64 bits (63 – 0) for the signiﬁcand. Since there is no hidden bit in the extended format, one of these bits, bit 63, is required for the integer part.
Example 14-d ____________________________________________________________________________________________________
Show how 97.8125 is stored in 80-bit extended IEEE 754 binary format.
First, convert the number to binary.
|(−1)0 ×1100001.1101× 20|
Adjust the exponent to obtain the normalized form.
Compute s, e+16383.
|e + 16383||= 6 + 16383|
Filling in the bit patterns as speciﬁed above:
|97.8125||is stored as||0 100000000000101 1 1000011101000000000000…02|
Compare this with the 32-bit format in Example 14-a above.
The 16-bit Floating Point Unit Status Word register shows the results of ﬂoating point operations. The meaning of each bit is shown in Table 14.4.
|7||ES||error summary status|
|8||C0||condition code 0|
|9||C1||condition code 1|
|10||C2||condition code 2|
|11 – 13||TOP||top of stack|
|14||C3||condition code 3|
Figure 14.2 shows a pictorial representation of the x87 ﬂoating point registers. The absolute locations are named fpr0, fpr1,…,fpr7 in this ﬁgure.
The ﬂoating point registers are accessed by program instructions as a stack with st(0) being the register at the top of the stack. It “grows” from higher number registers to lower. The TOP ﬁeld (bits 13 – 11) in the FPU Status Word holds the (absolute) register number that is currently the top of the stack. If the stack is full, i.e., fpr0 is the top of the stack, a push causes the TOP ﬁeld to roll over, and the next item goes into register fpr7. (The value that was in fpr7 is lost.)
The instructions that read data from memory automatically push the value onto the top of the register stack. Arithmetic instructions are provided that operate on the value(s) on the top of the stack. For example, the faddp instruction adds the two values on the top of the stack and leaves their sum on the top. The stack has one less value on it. The original two values are gone.
Many ﬂoating point instructions allow the programmer to access any of the ﬂoating point registers, %st(i), where i = 0…7, relative to the top of the stack. Fortunately, the programmer does not need to keep track of where the top is. When using this format, %st(i) refers to the ith register from the top of the stack. For example, if fpr3 is the current top of the stack, the instruction
will add the value in the fpr5 register to the value in the fpr3 register, leaving the result in the fpr3 register.
Table 14.5 provides some examples of the ﬂoating point instruction set. Notice that the instructions that deal only with the ﬂoating point register stack do not use the size suﬃx letter, s.
add memﬂoat to st(0)
add st(0) to st(1) and pop register stack
change sign of st(0)
compare st(0) with memﬂoat
compare st(0) with st(1) and pop register stack
replace st(0) with its cosine
divide st(0) by memﬂoat
divide st(0) by st(1), store result in st(1), and pop register stack
convert integer at memint to 80-bit ﬂoat and push onto register stack
convert 80-bit ﬂoat at st(0) to int and store at memint
convert ﬂoat at memint to 80-bit ﬂoat and push onto register stack
multiply st(0) by memﬂoat
multiply st(0) by st(1), store result in st(1), and pop register stack
replace st(0) with its sine
replace st(0) with its square root
convert 80-bit ﬂoat at st(0) to s size ﬂoat and store at memint
subtract memﬂoat from st(0)
subtract st(0) from st(1) and pop register stack
s = s, l, t
To avoid ambiguity the gnu assembler requires a single letter suﬃx on the ﬂoating point instructions that access memory. The suﬃxes are:
|’s’||for single precision – 32-bit|
|’l’||for long (or double) precision – 64-bit|
|’t’||for ten-byte – 80-bit|
Most of the ﬂoating point instructions have several variants. See  –  and  –  for details. In general,
converts the 80-bit ﬂoating point number in st(0) to a 32-bit integer and stores it at the speciﬁed memory location. Using the pop variant,
does the same thing but also pops one from the ﬂoating point register stack.
Compiling the fraction conversion program of Listing 14.2 in 32-bit mode shows (Listing 14.4) that the compiler uses the x87 ﬂoating-point instructions. This ensures backward compatibility since the x86-32 architecture does not need to include SSE instructions.
We add comments to lines 19 and 21 to show where the x and y variables are located in the stack frame.
Rather than actually push the arguments onto the stack, enough space was allocated on the stack (line 15) to directly store the values in the location where they would be if they had been pushed there. This is more eﬃcient that pushing each argument.
Casting an int to a float requires a conversion in the storage format. This conversion is done by the x87 FPU as an integer is pushed onto the ﬂoating point register stack using the fildl instruction. This conversion can only be done to an integer that is stored in memory. The compiler uses a location on the call stack to temporarily store each integer so it can be converted:
Each of the fildl instructions on lines 27 and 30 converts a 32-bit integer to an 80-bit ﬂoat and pushes the ﬂoat onto the x87 register stack. At this point the ﬂoating-point equivalent of x is at the top of the stack, and the ﬂoating-point equivalent of y is immediately below it. Then the ﬂoating-point division instruction:
divides the number at st(0) (the (0) can be omitted) by the number at st(1) and pops the x87 register stack so that the result is now at the top of the stack.
Finally, the fstpl instruction is used to pop the value oﬀ the top of the x87 register stack and store it in memory — at its proper location on the call stack. The “l” suﬃx indicates that 64 bits of memory should be used for storing the ﬂoating-point value. So the 80-bit value on the top of the x87 register stack is rounded to 64 bits as it is stored in memory. The other three arguments are also stored on the call stack.
Note that the 32-bit version of printf does not receive arguments in registers, so eax is not used.
The 3DNow! instructions use the low-order 64 bits in the same physical registers at the x87. These 64-bit portions are named mmx0, mmx1,…, mmx7. They are used as ﬁxed register, not in a stack conﬁguration. Execution of a 3DNow! instruction changes the TOP ﬁeld (bits 13 – 11) in the MXCSR status register, so the top of stack is lost for any subsequent x87 instructions. The bottom line is that x87 and 3DNow! instructions cannot be used simultaneously.
Another limitation of the 3DNow! instructions set is that it only handles 32-bit ﬂoating point.
These limitations of the 3DNow! instruction set make it essentially obsolete in the x86-64 architecture, so it will not be discussed further in this book.
Beginning programmers often see ﬂoating point arithmetic as more accurate than integer. It is true that even adding two very large integers can cause overﬂow. Multiplication makes it even more likely that the result will be very large and, thus, overﬂow. And when used with two integers, the / operator in C/C++ causes the fractional part to be lost. However, as you have seen in this chapter, ﬂoating point representations have their own set of inaccuracies.
Arithmetically accurate results require a thorough analysis of your algorithm. Some points to consider:
This summary shows the assembly language instructions introduced thus far in the book. The page number where the instruction is explained in more detail, which may be in a subsequent chapter, is also given. This book provides only an introduction to the usage of each instruction. You need to consult the manuals ( – ,  – ) in order to learn all the possible uses of the instructions.
|cbtw||convert byte to word, al → ax||696|
|cwtl||convert word to long, ax → eax||696|
|cltq||convert long to quad, eax → rax||696|
|cwtd||convert word to long, ax → dx:ax||786|
|cltd||convert long to quad, eax → edx:eax||786|
|cqto||convert quad to octuple, rax → rdx:rax||786|
|movsss||$imm/%reg||%reg/mem||move, sign extend||693|
|movzss||$imm/%reg||%reg/mem||move, zero extend||693|
|popw||%reg/mem||pop from stack||566|
|pushw||$imm/%reg/mem||push onto stack||566|
s = b, w, l, q; w = l, q; cc = condition codes
|leaw||mem||%reg||load eﬀective address||579|
|ors||$imm/%reg||%reg/mem||bit-wise inclusive or||747|
|ors||mem||%reg||bit-wise inclusive or||747|
|sals||$imm/%cl||%reg/mem||shift arithmetic left||756|
|sars||$imm/%cl||%reg/mem||shift arithmetic right||751|
|xors||$imm/%reg||%reg/mem||bit-wise exclusive or||747|
|xors||mem||%reg||bit-wise exclusive or||747|
s = b, w, l, q; w = l, q
program ﬂow control:
|ja||label||jump above (unsigned)||683|
|jae||label||jump above/equal (unsigned)||683|
|jb||label||jump below (unsigned)||683|
|jbe||label||jump below/equal (unsigned)||683|
|jg||label||jump greater than (signed)||686|
|jge||label||jump greater than/equal (signed)||686|
|jl||label||jump less than (signed)||686|
|jle||label||jump less than/equal (signed)||686|
|jne||label||jump not equal||679|
|jno||label||jump no overﬂow||679|
|jcc||label||jump on condition codes||679|
|leave||undo stack frame||580|
|ret||return from function||583|
|syscall||call kernel function||587|
cc = condition codes
x87 ﬂoating point:
|flds||memﬂoat||load ﬂoating point||859|
|fsts||memint||ﬂoating point store||859|
s = s, l, t
SSE ﬂoating point conversion:
|cvtsd2si||%xmmreg/mem||%reg||scalar double to signed integer||845|
|cvtsd2ss||%xmmreg||%xmmreg/%reg||scalar double to single ﬂoat||845|
|cvtsi2sd||%reg||%xmmreg/mem||signed integer to scalar double||845|
|cvtsi2sdq||%reg||%xmmreg/mem||signed integer to scalar double||845|
|cvtsi2ss||%reg||%xmmreg/mem||signed integer to scalar single||845|
|cvtsi2ssq||%reg||%xmmreg/mem||signed integer to scalar single||845|
|cvtss2sd||%xmmreg||%xmmreg/mem||scalar single to scalar double||845|
|cvtss2si||%xmmreg/mem||%reg||scalar single to signed integer||845|
|cvtss2siq||%xmmreg/mem||%reg||scalar single to signed integer||845|
The data value is located in a CPU register.
syntax: name of the register with a “%” preﬁx.
example: movl %eax, %ebx
The data value is located immediately after the instruction. Source operand only.
syntax: data value with a “$” preﬁx.
example: movl $0xabcd1234, %ebx
base register plus oﬀset:
The data value is located in memory. The address of the memory location is the sum of a value in a base register plus an oﬀset value.
syntax: use the name of the register with parentheses around the name and the oﬀset value immediately before the left parenthesis.
example: movl $0xaabbccdd, 12(%eax)
The target is a memory address determined by adding an oﬀset to the current address in the rip register.
syntax: a programmer-deﬁned label
example: je somePlace
The data value is located in memory. The address of the memory location is the sum of the value in the base_register plus scale times the value in the index_register, plus the oﬀset.
syntax: place parentheses around the comma separated list (base_register, index_register, scale) and preface it with the oﬀset.
example: movl $0x6789cdef, -16(%edx, %eax, 4)
(§14.1) Develop an algorithm for converting decimal fractions to binary. Hint: Multiply both sides of Equation 14.1 by two.
(§14.1) Show that two’s complement works correctly for fractional values. What is the decimal range of 8-bit, two’s complement fractional values? Hint: +0.5 does not exist, but -0.5 does.
(§14.3) Copy the following program and run it:
Explain the behavior. What happens if you change the decrement of number from 0.1 to 0.0625? Explain.
(§14.3 §14.4) Copy the following program and run it:
Explain the behavior. What is the maximum value of fNumber such that adding 1.0 to it works?
(§14.4) Convert the following decimal numbers to 32-bit IEEE 754 format by hand:
a) 1.0 b) -0.1 c) 2005.0 d) 0.00390625 e) -3125.3125 f) 0.33 g) -0.67 h) 3.14
(§14.4) Convert the following 32-bit IEEE 754 bit patterns to decimal.
a) 4000 0000 b) bf80 0000 c) 3d80 0000 d) c180 4000 e) 42c8 1000 f) 3f99 999a g) 42f6 e666 h) c259 48b4
(§14.4) Show that half the floats (in 32-bit IEEE 754 format) are between -2.0 and +2.0.
(§14.5) The following C program
was compiled with the -S option to produce
Identify the assembly language sequence that performs the C sequence
and describe what occurs.