Since writing the original version of this book, the default for the gcc compiler has changed to produce Position Independent Code (PIC) and Position Independent Executables (PIE), which are not covered in this book. To make things match the examples in this book:
While reading this chapter, you should also consult the info resources available in most GNU/Linux installations for both the make and the as programs. Appendix B provides a general tutorial for writing Makefiles, but you need to get the details from info. info is especially important for learning about as’s assembler directives.
You should also reread the Development Environment section on page xx.
Creating a program in assembly language is essentially the same as creating one in a high-level compiled language like C, C++, Java, FORTRAN, etc. We will begin the chapter by looking in detail at the steps involved in creating a C program. Then we will look at which of these steps apply to assembly language programming.
You have probably learned how to program using an Integrated Development Environment (IDE) , which incorporates several programs within a single user interface:
You enter your source code in the text editor part, click on a “build” button to compile and link your program, then click on a “run” button to load and execute the program. There is typically a “debug” button that loads and executes the program under control of the debugger program if you need to debug it. The individual steps of program preparation are obscured by the IDE user interface. In this book we use the GNU programming environment in which each step is performed explicitly.
Several excellent text editors exist for GNU/Linux, each with its own “personality.” My “favorite” changes from time to time. I recommend trying several that are available to you and deciding which one you prefer. You should avoid using a word processor to create source files because it will add formatting to the text (unless you explicitly specify text-only). Text editors I have used include:
GUI interfaces are available for both vi and emacs. Any of these, and many other, text editors would be an excellent choice for the programming covered in this book. Don’t spend too much time trying to pick the “best” one.
The GNU programming tools are executed from the command line instead of a graphical user interface (GUI). (IDEs for Linux and Unix are typically GUI frontends that execute GNU programming tools behind the scenes.) The GNU compiler, gcc, creates an executable program by performing several distinct steps [22]. The description here assumes a single C source file, filename.c.
Programs written in C are organized into functions. Each function has a name that is unique within the program. Program execution begins with the function named “main.”
Let us consider the minimum C program, Listing 7.1.
The only thing this program does is return a zero.
Despite the fact that this program accomplishes very little, some instructions need to be executed just to return zero. In order to see what takes place, we first translate this program from C to assembly language with the GNU/Linux command:
This creates the file doNothingProg1.s (see Listing 7.2), which contains the assembly language generated by the gcc compiler. The two compiler options used here have the following meanings:
This is not easy to read, even for an experienced assembly language programmer. So we will start with the program in Listing 7.3, which was written in assembly language by a programmer (rather than by a compiler). Naturally, the programmer has added comments to improve readability.
After examining what the assembly language programmer did we will return to Listing 7.2 and look at the assembly language generated by the compiler.
Assembly language provides of a set of mnemonics that have a one-to-one correspondence to the machine language. A mnemonic is a short, English-like group of characters that suggests the action of the instruction. For example, “mov” is used to represent the instruction that copies (“moves”) a value from one place to another. Thus, the machine instruction
copies the entire 64-bit value in the rsp register to the rbp register. Even if you have never seen assembly language before, the mnemonic representation of this instruction in Listing 7.2,
probably makes much more sense to you than the machine code. (The ‘q’ suffix on “mov” means a quadword (64 bits) is being moved.)
The first thing to notice is that assembly language is line-oriented. That is, there is only one assembly language statement on each line, and none of the statements spans more than one line. A statement can continue onto subsequent lines, but this requires a special line-continuation character. This differs from the “free form” nature of C/C++ where the line structure is irrelevant. In fact, good C/C++ programmers take advantage of this to improve the readability of their code.
Next, notice that the pattern of each line falls into one of three categories:
label: operation operand(s) #comment
The assembler requires at least one space or tab character to separate the fields. When writing assembly language, your program will be much easier to read if you use the tab key to move from one field to the next.
Let us consider each field:
Different operations require differing numbers of operands — zero, one, two, or three.
The rules for creating an identifier are very similar to those for C/C++. Each identifier consists of a sequence of alphanumeric characters and may include other printable characters such as “.”, “_”, and “$”. The first character must not be a numeral. An identifier may be any length, and all characters are significant. Case is also significant. For example, “myLabel” and “MyLabel” are different. Compiler-generated labels begin with the “.” character, and many system related names begin with the “_” character. It is a good idea to avoid beginning your own labels with the “.” or the “_” character so that you do not inadvertently create one that is already in use by the system.
The assembler program, as, will translate the file doNothingProg2.s (see Listing 7.3) into machine code and provide the memory allocation information for the operating system to use when the program is executed. We will first describe the contents of this file, then look at the GNU commands to convert it into an executable program.
Now we turn attention to the specific file in Listing 7.3, doNothingProg2.s. On line 5 you recognize
as an assembler directive because it starts with a period character. It directs the assembler to place whatever follows in the text section.
What does “text section” mean? When a source code file is translated into machine code, an object file is produced. The object file organization follows the Executable and Linking Format (ELF). ELF files can be seen from two different points of view. Programs that store information in ELF files store it in sections. The ELF standard specifies many different types of sections, each depending on the type of information stored in it.
The .text directive specifies that when the following assembly language statements are translated into machine instructions, they should be stored in a text section in the object file. Text sections are used to store program instructions in machine code format.
GNU/Linux divides memory into different segments for specific purposes when a program is loaded from the disk. The four general categories are:
The operating system needs to view an ELF file as a set of segments. One of the functions of the ld program is to group sections together into segments so that they can be loaded into memory. Each segment contains one or more sections. This grouping is generally accomplished by arrays of pointers to the file, not necessarily by physically moving the sections. That is, there is still a section view of the ELF file remaining. So the information stored in an ELF file is grouped into sections, but it may or may not also be grouped into segments.
When the operating system loads the program into memory, it uses the segment view of the ELF file. Thus the contents of all the text sections will be loaded into the text segment of the program process.
This has been a very simplistic overview of ELF sections and segments. We will touch on the subject again briefly in Section 8.1. Further details can be found by reading the man page for elf and sources like [13] and [21]. The readelf program is also useful for learning about ELF files. It is included in the binutils collection of the GNU binary tools so is installed along with as and ld.
The assembler directive on line 6
has one operand, the identifier “main.” As you know, all C/C++ programs start with the function named “main.” In this book, we also start our assembly language programs with a main function and execute them within the C/C++ runtime environment. The .globl directive makes the name globally known, analogous to defining an identifier outside a function body in C/C++.1 That is, code outside this file can refer to this name. When a program is executed, the operating system does some preliminary set up of system resources. It then starts program execution by calling a function named “main,” so the name must be global in scope.
The assembler directive on line 7
has two operands: a name and a type. The name is entered into the symbol table (see Section 7.3). In addition to the machine code, the object file contains the symbol table along with information about each symbol. The ELF format recognizes two types of symbols: data and function. The .type directive is used here to specify that the symbol main is the name of a function.
None of these three directives get translated into actual machine instructions, and none of them occupy any memory in the finished program. Rather, they are used to describe the characteristics of the statements that follow.
from
What follows next in Listing 7.3 are the actual assembly language instructions. They will occupy memory when they are translated. The first instruction is on line 8:
It illustrates the use of all four fields on a line of assembly language.
The “quadword” part of this instruction means that 64 bits are moved. As you will see in more detail later, as requires that a single letter be appended to most instructions:
“b” | ⇒ | “byte” | ⇒ | operand is 8 bits |
“w” | ⇒ | “word” | ⇒ | operand is 16 bits |
“l” | ⇒ | “long” | ⇒ | operand is 32 bits |
“q” | ⇒ | “quadword” | ⇒ | operand is 64 bits |
to specify the size of the operand(s).
The value in the rbp register is an address. In 64-bit mode addresses can be 64 bits long, and we have to save the entire address.
The next line
uses only three of the fields.
As the name of this “program” implies, it does not do anything, but it still must return to the operating system. GNU/Linux expects the main function to return an integer to it, and the return value is placed in the eax register. Zero means that the program executed with no errors. This may not make a lot of sense to you at this point, but it should become clearer later in the book. Returning the integer zero to the operating system is accomplished on line 12:
Even though the CPU is in 64-bit mode, 64-bit integers are seldom needed. So the default behavior of environment is to use 32 bits for ints. 64-bit ints can be specified in C/C++ with either the long or the long long modifier. In assembly language the programmer would use quadwords for integers. (As pointed out on page 460 this instruction also zeros the high-order 32 bits of the rax register. But you should not write code that depends upon this behavior.)
The first two instructions in this function,
form a prologue to the actual processing that is performed by the function. They changed some values in registers and used the call stack. Before returning to the operating system, it is essential that an epilogue be executed to restore the values. The following two instructions accomplish this.
Finally, this function must return to the function that called it, which is back in the operating system.
As you can see from this example, even a function that does nothing requires several instructions. The most commonly used assembly language instruction is
| movs | source, destination | |
where s denotes the size of the operand:
s | meaning | number of bits |
b | byte | 8 |
w | word | 16 |
l | longword | 32 |
q | quadword | 64 |
In the Intel syntax, the size of the data is determined by the operand, so the size character (b, w, l, or q) is not appended to the instruction, and the order of the operands is reversed:
Intel® Syntax |
| mov | destination, source |
The mov instruction copies the bit pattern from the source operand to the destination operand. The bit pattern of the source operand is not changed. If the destination operand is a register and its size is less than 64 bits, the effect on the other bits in the register is shown in Table 7.1.
size | destination bits | remaining bits |
8 | 7 – 0 | 63 – 8 are unchanged |
8 | 15 – 8 | 63 – 16, 7 – 0 are unchanged |
16 | 15 – 0 | 63 – 16 are unchanged |
32 | 31 – 0 | 63 – 32 are set to 0 |
The mov instruction does not affect the rflags register. In particular, neither the CF nor the OF flags are affected. No more than one of the two operands may be a memory location. Thus, in order to move a value from one memory location to another, it must be moved from the first memory location into a register, then from that register into the second memory location. (Accessing data in memory will be covered in Sections 8.1 and 8.4.)
The other instructions used in this “do nothing” program — pushq, popq, and ret — use the call stack. The call stack will be discussed in Section 8.2, which will then allow us to discuss these instructions. For now, you should memorize how to use them as “boilerplate” for the prologue and epilogue of each function.
If you have any experience with x86 assembly language, the syntax used by the GNU assembler, as, will look a little strange to you. In principle, the syntax is arbitrary. A programmer could invent any sort of assembly language and write a program that would translate it into the appropriate machine code. But most CPU manufacturers publish a manual with a suggested assembly language syntax for their CPU.
Most assemblers for the x86 CPUs follow the syntax suggested by Intel®, but as uses the
AT&T syntax. It is not radically different from Intel’s. Some of the more striking differences
are:
AT&T | Intel® | |
operand order: | source, destination | destination, source |
register names: | prefixed with the “%” character, e.g., %eax | just the name, e.g., eax |
literal values: | prefixed with the “$” character, e.g., $123 | just the value, e.g., 123 |
operand size: | use the b, w, l, or q suffix on opcode to denote byte, word, long, or quadruple word | determined by the register specification (more complicated if operand is stored in memory) |
The GNU assembler, as, does not require the size suffix on instructions in all cases. From the info documentation for as:
It is recommended that you get in the habit of using the size suffix letters when you begin writing your own assembly language. This will help you to avoid introducing obscure bugs in your code.
The assembler directives are typically not specified by the CPU manufacturer, so you will see a much wider variety of syntax, depending on the particular assembler program. We will not try to list any differences here.
The GNU assembler, as, also supports the Intel® syntax. The assembler directive .intel_syntax
says that following assembly language is written in the Intel® syntax; .att_syntax says it is
written in AT&T syntax. Using Intel® syntax, the assembly language code in Listing 7.3 would be
written
main: push rbp | |
mov rbp, rsp | |
Intel® Syntax | mov eax, 0 |
mov rsp, rbp | |
pop rbp | |
ret | |
Keep in mind that gcc produces assembly language in AT&T syntax, so you will undoubtedly find it easier to use that when you write your own code. The .intel_syntax directive might be useful if somebody gives you an entire function written in Intel® syntax assembly language.
The syntax rules for our particular assembler, as, are described in an on-line manual that is in the GNU info format. as supports some two dozen computer architectures, so it is a challenge to wade through the info manual to find what you need. On the other hand, it provides the most up to date information. And it is especially important for learning how to use assembler directives because they are specific to the assembler.
Now would be a good time to start learning how to use info for as. As you encounter new assembly language concepts in this book, also look them up in info for as. If you are unfamiliar with info, at the GNU/Linux prompt, simply type
for a nice tutorial.
First, notice that the compiler-generated labels (.LFB0, .LFE0) each begin with a period character, just like assembler directives. You can tell that they are labels because of the “:” immediately following.
If you compare the assembly language program in Listing 7.3 with that generated by the compiler in Listing 7.2, you can see that the compiler includes much more information in the file. Most of this information will not be used elsewhere in this book. We explain it here for completeness.
The first line,
identifies the name of the C source file. When you write in assembly language this information clearly does not apply.
There are six .cfi_… (Call Frame Information) directives in this code on lines 7, 9, 10, 12, 15, and 17. These are used to provide information that is helpful in error recovery and debugging the program when it is running. We will not use them in this book. The gcc option to generate assembly language code without them is -fno-asynchronous-unwind-tables. This will be discussed in more detail below.
The eight lines
set up the call stack for this function. The use of the call stack will be explained in more detail in Section 8.2 on page 551 and in subsequent Sections.
The labels generated by the compiler, .LFB0 and .LFE0 are used to mark the beginning and end of the program to improve the linking efficiency. Our programs will be very simple, so we will not need such labels.
Notice that the lines after the labels main, .LFB0, and .LFE0, are blank. The assembler does not generate any machine code for either of these two lines, so they do not take up any memory. The next thing that comes in memory is the
instruction. Thus, both labels, main and .LFB0, apply to the address where this instruction is located.
We can use the -fno-asynchronous-unwind-tables option to turn off the .cfi directives feature, as shown in Listing 7.4. The GNU/Linux command is:
which gives the compiler-generated assembly language in Listing 7.4.
Lines 19 – 21 in Listing 7.2 are the same as lines 11 – 13 in Listing 7.4. They also use directives that do not apply to the programs we will be writing in this book.
Finally, you may have noticed that the main label is on a line by itself in Listing 7.2 but not in Listing 7.3. When there is only a label on a line, no machine instructions are generated, and no memory is allocated. Thus, the label really applies to the next line. It is common to place labels on their own line so that longer, easier to read labels can be used while still keeping the operations visually lined up in a column. This technique is illustrated in Listing 7.5.
The gcc compiler provides a set of options that will allow you to generate a listing that shows both the assembly language and the corresponding C statement(s). This will allow you to more easily see the assembly language that the compiler generates to implement a C statement in assembly language. Compiling the program in Listing 7.1 with the command:
generates the assembly language code in Listing 7.6.
The “-g” option tells the compiler to include symbols for debugging. “-Wa,” passes the immediately following options to the assembly phase of the compilation process. Thus, the options passed to the assembler are “-adhls”, which cause the assembler to generate a listing with the following characteristics:
As you can see above, the secondary letters can be combined with one “-a.” The listing is written to standard out, which can be redirected to a file. We gave this file the “.lst” file extension because it cannot be assembled.
The x86-64 processors can also run in 32-bit mode. Most GNU/Linux distributions also provide a 32-bit version. Some distributions are only available in 32-bit.
The gcc option to compile a program for 32-bit mode is -m32. Listing 7.7 shows the assembly language generated by the GNU/Linux command:
The only differences between the 32-bit and 64-bit versions are that all the instructions use the “l” suffix to indicate “longword” because addresses are 32 bits, and only the 32-bit portion of the registers is used. That is, esp instead of rsp, etc.
An assembly language programmer would comment the code, as shown in Listing 7.8.
This assembly language version also includes the instruction to restore the stack pointer near the end of the function:
Since this function does not use the stack, this instruction is not required, but including it is a good habit to establish as you start to write your own functions in assembly language.
We present a highly simplified view of how assemblers and linkers work here. The goal of this presentation is to introduce the concepts. Most assemblers and linkers have capabilities that go far beyond the concepts described here (e.g., macro expansion, dynamic load/link). We leave a more thorough discussion of assemblers and linkers to a book on systems programming.
An assembler must perform the following tasks:
Since the numeric value of an address may be required before an instruction can be translated into machine language, there is a problem with forward references to memory locations. For example, a code sequence like:
creates a problem for the assembler when it needs to translate the
instruction on line 3. (Don’t forget that assembly language is line oriented; translation is done one line at a time.) When this code sequence is executed, the immediately previous instruction (cmpb $’y’, response) compares the byte stored at location response with the character ‘y’. If they are not equal, i.e., a ‘y’ is not stored at location response, the jne instruction causes program flow to jump to location noChange. In order to accomplish this action, the translation of this instruction (the machine code) must include a numerical value that specifies how far to jump. That is, it must include the distance, in number of bytes, between the jne instruction and the memory location labeled noChange on line 23. In order to compute this distance, the assembler must determine the address that corresponds to the label noChange when it translates this instruction, but the assembler has not even encountered the noChange label, much less determined its corresponding address.
The simplest solution is to use a two-pass assembler:
Algorithm 7.1 is a highly simplified description of how the first pass of an assembler works.
The symbol table is carried from the first pass to the second. The second pass also consults a table of operation codes, which provides the machine code corresponding to each instruction mnemonic. A highly simplified description of the second pass is given in Algorithm 7.2.
Look again at the code sequence above. On line 14 there is the instruction:
This call to the write function is a reference to a memory label outside the file being assembled. Thus, the assembler has no way to determine the address of write for the symbol table during the first pass. The only thing the assembler can do during the second pass is to leave enough memory space for the address of write when it assembles this instruction. The actual address will have to be filled in later in order to create the entire program. Filling in these references to external memory locations is the job of the linker program.
The algorithm for linking functions together is very similar to that of the assembler. The same forward reference problem exists. Again, the simplest solution is to use a two-pass linker program.
The highly simplified algorithm in Algorithms 7.3 and 7.4 also provide for loading the entire program into memory. The functions are linked together as they are loaded. In practice, this is seldom done. For example, the GNU linker, ld, does not load the program into memory. Instead, it creates another machine language file — the executable program. The executable program file contains all the functions of the program with all the cross-function memory references resolved. Thus ld is a link editor program.
Getting even more realistic, many of the functions used by a program are not even included in the executable program file. They are loaded as required when the program is executing. The link editor program must provide dynamic links for the executable program file.
However, you can get the general idea of linking separately assembled (or compiled) functions together by studying the algorithms in Algorithms 7.3 and 7.4. In particular, notice that the assembler (or compiler) must include other information in addition to machine code in the object file. The additional information includes:
Since we are concerned with assembly language in this book, let us go through the steps of creating a program for the assembly language source code in Listing 7.5. Figure 7.1 is a screen shot of what I did with my typing in boldface. The notation I use here assumes that I am doing this for a class named CS 252, and my instructor has specified that each project should be submitted in a directory named CS252lastNameNN , where lastName is the student’s surname and NN is the project number.2 I have appended .0 to the project folder name for my own use. As I develop my project, subsequent versions will be numbered .1, .2, ….
bob$ mkdir CS252plantz01.0
bob$ cd CS252plantz01.0/
bob$ ls
bob$ pwd /home/bob/CS252/CS252plantz01.0
bob$ gedit doNothingProg.s
This is where I used gedit to enter the program from Listing 7.5, saved the program, and quit gedit.
bob$ ls
doNothingProg.s
bob$ as –gstabs -o doNothingProg.o doNothingProg.s
bob$ ls
doNothingProg.o doNothingProg.s
bob$ gcc -o doNothingProg doNothingProg.o
bob$ ls
doNothingProg doNothingProg.o doNothingProg.s
bob$ ./doNothingProg
bob$
____________________________________________________________________________________
Let us go through the steps in Figure 7.1 one line at a time, explaining each line.
bob$ mkdir CS252plantz01.0
I create a directory named “CS252plantz01.0.” All the files that you create for each program should be kept in a separate directory only for that program.
bob$ cd CS252plantz01.0/
I make the newly created subdirectory the current working directory.
bob$ ls
bob$ pwd /home/bob/CS252/CS252plantz01.0
These two commands show that the new subdirectory is empty and where my current working directory is located within the file hierarchy.
bob$ gedit doNothingProg.s
This starts up the gedit program and creates a new file named “doNothingProg.s.” I typed my program using the text editor, saved the file, and quit the text editor. You may use any text editor. Avoid using a word processor because it will probably add formatting codes to the file.
bob$ ls
doNothingProg.s
This shows that I have created the file, doNothingProg.s.
bob$ as –gstabs -o doNothingProg.o doNothingProg.s
bob$ ls
doNothingProg.o doNothingProg.s
On the first line, I invoke the assembler, as. The –gstabs option directs the assembler to include debugging information with the output file. We will very definitely make use of the debugger! The -o option is followed by the name of the output (object) file. You should always use the same name as the source file, but with the .o extension. The second command simply shows the new file that has been created in my directory.
bob$ gcc -o doNothingProg doNothingProg.o
bob$ ls
doNothingProg doNothingProg.o doNothingProg.s
Next I link the object file. Even though there is only one object file, this step is required in order to bring in the GNU/Linux libraries needed to create an executable program. As in as, the -o option is used to specify the name of a file. In the linking case, this will be the name of the final product of our efforts.
is that you must also explicitly specify all the libraries that are used. By using gcc for the linking, the appropriate libraries are automatically included in the linking.
bob$ ./doNothingProg
bob$
Finally, I execute the program (which does nothing).
This summary shows the assembly language instructions introduced thus far in the book. It should be sufficient for doing the exercises in the current chapter. The page number where the instruction is explained in more detail, which may be in a subsequent chapter, is also given. The summary will be repeated and updated, as appropriate, at the end of each succeeding chapter in the book. This book provides only an introduction to the usage of each instruction. You need to consult the manuals ([2] – [6], [14] – [18]) in order to learn all the possible uses of the instructions.
data movement: | ||||
opcode | source | destination | action | page |
movs | $imm/%reg | %reg/mem | move | 506 |
movs | mem | %reg | move | 506 |
popw | %reg/mem | pop from stack | 566 | |
pushw | $imm/%reg/mem | push onto stack | 566 | |
s = b, w, l, q; w = l, q
| ||||
arithmetic/logic:
| ||||
opcode | source | destination | action | page |
cmps | $imm/%reg | %reg/mem | compare | 676 |
cmps | mem | %reg | compare | 676 |
incs | %reg/mem | increment | 698 | |
s = b, w, l, q
| ||||
program flow control:
| |||
opcode | location | action | page |
call | label | call function | 546 |
je | label | jump equal | 679 |
jmp | label | jump | 691 |
jne | label | jump not equal | 679 |
ret | return from function | 583 | |
The functions you are asked to write in these exercises are not complete programs. You can check that you have written a valid function by writing a main function in C that calls the function you have written in assembly language. Compile the main function with the -c option so that you get the corresponding object (.o) file. Assemble your assembly language file. Make sure that you specify the debugging options when compiling/assembling. Use the linking phase of gcc to link the .o files together. Run your program under gdb and set a breakpoint in your assembly language function. (Hint: you can specify the source file name in gdb commands.) Now you can verify that your assembly language function is being called. If the function returns a value, you can print that value in the main function using printf.
(§7.2) Write the C function:
in assembly language. Make sure that it assembles with no errors. Use the -S option to compile f.c and compare gcc’s assembly language with yours.
(§7.2) Write the C function:
in assembly language. Make sure that it assembles with no errors. Use the -S option to compile g.c and compare gcc’s assembly language with yours.
(§7.2) Write the C function:
in assembly language. Make sure that it assembles with no errors. Use the -S option to compile h.c and compare gcc’s assembly language with yours.
(§7.2) Write three assembly language functions that do nothing but return an integer. They should each return different, non-zero, integers. Write a C main function to test your assembly language functions. The main function should capture each of the return values and display them using printf.
(§7.2) Write three assembly language functions that do nothing but return a character. They should each return different characters. Write a C main function to test your assembly language functions. The main function should capture each of the return values and display them using printf.
(§7.2, §6.5) Write an assembly language function that returns four characters. The return value is always in the eax register in our environment, so you can store four characters in it. The easiest way to do this is to determine the hexadecimal value for each character, then combine them so you can store one 32-bit hexadecimal value in eax.
Write a C main function to test your assembly language function. The main function should capture the return values and display them using the write system call.
Explain the order in which they are displayed.