Linux allows multiple programs to be executing concurrently, and each of the programs is accessing the hardware resources of the computer. One of the jobs of the OS is to manage the hardware resources in such a way that the programs do not interfere with one another. In this chapter we introduce the CPU features that enable Linux to carry out this management task.
The read system call is a good example of a program using the services of the OS. It requests input from the keyboard. The OS handles all input from the keyboard, so the read function must ﬁrst request keyboard input from the OS. One of the reasons this request must be funneled through the OS is that other programs may also be requesting input from the keyboard, and the OS needs to ensure that each program gets the keyboard input intended for it.
Once the request for input has been made, it would be very ineﬃcient for the OS to wait until a user strikes a key. So the OS allows another program to use the CPU, and the keyboard notiﬁes the OS when a key has been struck. To avoid losing a character, this notiﬁcation interrupts the CPU so that the OS can read the character from the keyboard.
Another example comes from something you probably did not intend to do. Unless you are a perfect programmer, you have probably seen a “segmentation fault.” This can occur when your program attempts to access memory that has not been allocated for your program. I have gotten this error (yes, I still make programming mistakes!) when I have made a mistake using the stack, or when I dereference a register that contains a bad address.
We can summarize these three types of events:
In response to any of these events, the CPU performs an operation that is very similar to the call instruction. The value in the rip register is pushed onto the stack, and another address is placed in the rip register. The net eﬀect is that a function is called, just as in the call instruction, but the address of the called function is speciﬁed in a diﬀerent way, and additional information is pushed onto the stack. Before describing the diﬀerences, we discuss what ought to occur in order for the OS to deal with each of these events.
Keyboard input is a good place to start the discussion. It is impossible to know exactly when someone will strike a key on the keyboard, nor how soon the next key will be struck. For example, if a key is struck in the middle of executing the ﬁrst of the following two instructions
in order to avoid losing the keystroke, we would like to read the character immediately after the cmpb instruction is executed but before the CPU starts working on the je instruction.
The function that reads the character from the keyboard is called an interrupt handler or simply handler. Handlers are part of the OS. In Linux they can either be built into the kernel or loaded as separate modules as needed.
The timing — between the two instructions — means that the CPU will acknowledge an interrupt only between instruction execution cycles. Just before executing the je instruction the rip register has the address of the instruction, and it is that address that gets pushed onto the stack. That is, since calling a handler occurs automatically and does not involve fetching an instruction, the current value of the rip pushed onto the stack is the correct return address from the handler.
There is another important issue. It is almost certain that the rflags register will be changed by the handler that gets called. When program control returns to the je instruction (which is supposed to depend on the state of the rflags register as a result of executing the cmpb instruction), there is little chance that the program will do what the programmer intended. Thus we conclude that in addition to saving the rip register,
The next issue is the question of how the CPU knows the address of the appropriate handler to call. In the call instruction, the address of the function to call is speciﬁed as an operand to the instruction. For example,
Since the keyboard has no knowledge of the software, there must be some other mechanism for specifying the address of the handler to call. The answer to this problem is that addresses of interrupt handlers are stored in an Interrupt Descriptor Table (IDT). Each possible interrupt in the system is associated with a unique entry in the IDT.
The IDT table entries are data structures (128 bits in 64-bit mode, 64 bits in 32-bit mode) called gate descriptors. In addition to the handler address, they contain information that the CPU uses to help protect the integrity of the OS.
After it has completed execution of the current instruction, the following actions must occur when there is an interrupt from a device external external to the CPU:
We next consider exceptions. These are typically the result of a number that the CPU cannot deal with. Examples are
In a perfect world, the application software would include all the checks that would prevent the occurrence of many of these errors. The reality is that no program is perfect, so some of these errors will occur.
When they do occur, it is the responsibility of the OS to take an appropriate action. The currently executing instruction may have caused the exception to occur. So the CPU often reacts to an exception in the midst of a normal instruction execution cycle. The actions that the CPU must take in response to an exception are essentially the same as those for an interrupt:
Not all exceptions are due to actual program errors. For example, when a program references an address in another part of the program that has not yet been loaded into memory, it causes a page fault exception. The OS must provide a handler that loads the appropriate part of the program from the disk into memory, then continues with normal program execution.
The usefulness of the interrupt/exception handling mechanism for requesting OS services is not apparent until we discuss privilege levels. As mentioned above, one of the jobs of the OS is to keep concurrently executing programs from interfering with one another. It uses the privilege level mechanism in the CPU to do this.
At any given time, the CPU is running in one of four possible privilege levels. The levels, from most privileged to least, are:
Provides direct access to all hardware resources. Restricted to the lowest-level operating system functions, e.g., BIOS, memory management.
Somewhat restricted access to hardware resources. Might be used by library routines and software that controls I/O devices.
More restricted access to hardware resources. Might be used by library routines and software that controls I/O devices.
No direct access to hardware resources. Applications programs run at this level.
The OS needs to have direct access to all the hardware, so it executes at privilege level 0. Application programs should be limited, so they execute at privilege level 3. The CPU includes a mechanism for recognizing the privilege level of the memory associated with each program. A program can access memory at a lower privilege level, but not at a higher level. Thus, an application program (running at level 3) cannot access memory that belongs to the OS.
Gate descriptors include privilege level information in addition to the handler address. The CPU’s interrupt/exception mechanism automatically switches to this privilege level when it calls the handler function. Thus, for example, the keyboard might interrupt during the execution of an application program running at privilege level 3, but its handler function would execute at privilege level 0.
The software interrupt allows an application program to use OS services while still allowing the OS to control this access. The instruction is
where n speciﬁes the nth entry in the IDT table.
Older versions of the Linux kernel used
to make system calls. The code corresponding to the desired action is loaded into eax and the arguments are loaded into the proper registers before the system call is executed. The recommended technique for making system calls is discussed in Section 15.6 on page 878.
Each entry in the IDT is called a vector. The CPU is hardwired to associate vectors 0 – 31 with speciﬁc exceptions. For example, vector number 0 represents a divide-by-zero exception. Vector number 14 is a page fault exception.
Vectors 32 – 255 can be assigned to interrupts, both external and the int $n instruction. These assignments are determined by the OS programmers.
During OS initialization, the address of a handler function is stored in the gate descriptor corresponding to the vector number it is designed to handle. Other information in the gate descriptor causes the CPU to switch to a higher (numerically lower) privilege level, so the handler function has appropriate access to the hardware.
Whenever an interrupt or exception occurs the CPU executes an exception processing cycle, which consists of the following actions:
The CPU continues with a normal instruction processing cycle — fetch the instruction at the address in rip, etc. Thus, control will transfer to the handler function.
Depending upon the nature of an exception and what actually caused it, CPU execution may or may not be returned to the program that was executing when the exception occurred.
There is one more part of this puzzle. Since the ret instruction simply pops the value at the top of the stack into the rip register, it will not work for the OS’s handler function. The CPU has another instruction
that correctly pops everything oﬀ the stack into the rip and rflags registers and restores the privilege level to where it was before the handler function was invoked. (The privilege level information was also stored on the stack.)
Using a software interrupt to invoke one of the services provided by the OS is somewhat of an overkill. The x86-64 architecture includes another instruction that causes the CPU to change priority levels but not use the stack nor go through the IDT table, thus saving execution time. The instruction is
We ﬁrst introduced it in Section 8.6 (page 589) to perform I/O.
The syscall instruction causes the CPU to
Now the CPU has been switched to privilege level 0, and the OS has control and can enforce orderly use of the hardware.
The program in Listing 15.1 illustrates the use of syscall to do system calls without using the C libraries. See Exercise 15-1 for using syscall within the C runtime environment.
In Section 8.1 (page 544) we saw how to call the write system call function to write characters to standard out (the screen). write and the other system call functions are simply C wrappers that load the proper code in eax and the arguments into the appropriate registers.
Several system call codes are shown in Table 15.1.
|read||0||ﬁle descriptor||pointer to||number of||number of|
|0||storage area||bytes to read||bytes read|
|write||1||ﬁle descriptor||pointer to||number of||number of|
|1||ﬁrst byte||bytes to write||bytes written|
|open||2||pointer to||ﬂags||mode||ﬁle descriptor|
For additional system call codes see the unistd_64.h ﬁle on your system. The arguments for each system call are given in the man page for the corresponding C version. For example,
describes the write system call.
There is a complementary instruction, sysret, which the OS executes in order to return from a system call:
The sysret instruction causes the CPU to
We summarize the diﬀerences between a call instruction and an interrupt/exception. The similarities are
The additional features of the interrupt/exception are
This summary shows the assembly language instructions introduced thus far in the book. The page number where the instruction is explained in more detail, which may be in a subsequent chapter, is also given. This book provides only an introduction to the usage of each instruction. You need to consult the manuals ( – ,  – ) in order to learn all the possible uses of the instructions.
|cbtw||convert byte to word, al → ax||699|
|cwtl||convert word to long, ax → eax||699|
|cltq||convert long to quad, eax → rax||699|
|cwtd||convert word to long, ax → dx:ax||788|
|cltd||convert long to quad, eax → edx:eax||788|
|cqto||convert quad to octuple, rax → rdx:rax||788|
|movsss||$imm/%reg||%reg/mem||move, sign extend||696|
|movzss||$imm/%reg||%reg/mem||move, zero extend||696|
|popw||%reg/mem||pop from stack||568|
|pushw||$imm/%reg/mem||push onto stack||568|
s = b, w, l, q; w = l, q; cc = condition codes
|leaw||mem||%reg||load eﬀective address||581|
|ors||$imm/%reg||%reg/mem||bit-wise inclusive or||750|
|ors||mem||%reg||bit-wise inclusive or||750|
|sals||$imm/%cl||%reg/mem||shift arithmetic left||759|
|sars||$imm/%cl||%reg/mem||shift arithmetic right||754|
|xors||$imm/%reg||%reg/mem||bit-wise exclusive or||750|
|xors||mem||%reg||bit-wise exclusive or||750|
s = b, w, l, q; w = l, q
program ﬂow control:
|iret||return from kernel function||878|
|ja||label||jump above (unsigned)||686|
|jae||label||jump above/equal (unsigned)||686|
|jb||label||jump below (unsigned)||686|
|jbe||label||jump below/equal (unsigned)||686|
|jg||label||jump greater than (signed)||689|
|jge||label||jump greater than/equal (signed)||689|
|jl||label||jump less than (signed)||689|
|jle||label||jump less than/equal (signed)||689|
|jne||label||jump not equal||682|
|jno||label||jump no overﬂow||682|
|jcc||label||jump on condition codes||682|
|leave||undo stack frame||582|
|ret||return from function||585|
|syscall||call kernel function||589|
|sysret||return from kernel function||883|
cc = condition codes
x87 ﬂoating point:
|flds||memint||load ﬂoating point||861|
|fsts||memint||ﬂoating point store||861|
s = b, w, l, q; w = l, q
SSE ﬂoating point conversion:
|cvtsd2si||%xmmreg/mem||%reg||scalar double to signed integer||847|
|cvtsd2ss||%xmmreg||%xmmreg/%reg||scalar double to single ﬂoat||847|
|cvtsi2sd||%reg||%xmmreg/mem||signed integer to scalar double||847|
|cvtsi2sdq||%reg||%xmmreg/mem||signed integer to scalar double||847|
|cvtsi2ss||%reg||%xmmreg/mem||signed integer to scalar single||847|
|cvtsi2ssq||%reg||%xmmreg/mem||signed integer to scalar single||847|
|cvtss2sd||%xmmreg||%xmmreg/mem||scalar single to scalar double||847|
|cvtss2si||%xmmreg/mem||%reg||scalar single to signed integer||847|
|cvtss2siq||%xmmreg/mem||%reg||scalar single to signed integer||847|
The data value is located in a CPU register.
syntax: name of the register with a “%” preﬁx.
example: movl %eax, %ebx
The data value is located immediately after the instruction. Source operand only.
syntax: data value with a “$” preﬁx.
example: movl $0xabcd1234, %ebx
base register plus oﬀset:
The data value is located in memory. The address of the memory location is the sum of a value in a base register plus an oﬀset value.
syntax: use the name of the register with parentheses around the name and the oﬀset value immediately before the left parenthesis.
example: movl $0xaabbccdd, 12(%eax)
The target is a memory address determined by adding an oﬀset to the current address in the rip register.
syntax: a programmer-deﬁned label
example: je somePlace
The data value is located in memory. The address of the memory location is the sum of the value in the base_register plus scale times the value in the index_register, plus the oﬀset.
syntax: place parentheses around the comma separated list (base_register, index_register, scale) and preface it with the oﬀset.
example: movl $0x6789cdef, -16(%edx, %eax, 4)
(§15.6) Modify the program in Figure 15.1 so that it uses the C environment. That is, turn it into a main function using the prototype int main(int argc, char **argv);. argc is the number of space-delimited strings on the command line, including the command to execute the program. argv is a pointer to an array of pointers to each of the command line strings.