Assemblers and Linkers

Section 9.5 Assemblers and Linkers

In this section we look at what assemblers and linkers do to process source code and create an executable program. The goal of this presentation is to introduce the concepts. Most assemblers and linkers have capabilities that go far beyond the concepts described here (e.g., macro expansion, dynamic load/link). We leave a more thorough discussion of assemblers and linkers to a book on systems programming.

Subsection 9.5.1 Assemblers

An assembler must perform the following tasks:

Translate assembly language mnemonics into machine language.
Translate symbolic names for addresses into numeric addresses.

Since the numeric value of an address may be required before an instruction can be translated into machine language, there is a problem with forward references to memory locations. For example, let us say we have an algorithm that gives a correct result if the r3 register contains \(100\text{.}\) In that case, we want our function to return \(0\text{.}\) Otherwise, it should return \(-1\text{.}\) So we might end the function with code like (you can probably guess that cmp means “compare” and bne means “branch if not equal”):

        mov     r0, 0           @ assume a good result
        cmp     r3, 100         @ is it good?
        bne     doReturn        @ yes, return 0
        mov     r0, -1          @ no, return a -1
doReturn:
        sub     sp, fp, 0       @ delete allocated memory
        ldr     fp, [sp], 4     @ restore caller's frame pointer
        bx      lr              @ back to caller

This creates a problem for the assembler when it needs to determine the address of the label doReturn on the line

bne     doReturn        @ yes, return 0

because it has not yet encountered the label itself. So it does not know how far away that location is.

The simplest solution to this problem is to use a two-pass assembler

The first pass builds a symbol table, which provides an address for each memory label.
The second pass performs the actual translation into machine language, consulting the symbol table for numeric values of the symbols.

Algorithm 9.5.1 is a highly simplified description of how the first pass of an assembler works.

Algorithm 9.5.1. First pass of a two-pass assembler.

\(\displaystyle locationCounter = 0\)
Get first line of source code
While (more lines of code)
- If (line has a label)
  - \(\displaystyle symbolTable.symbol = label\)
  - \(\displaystyle symbolTable.location = locationCounter\)
- Determine number of bytes required by line when assembled
- \(\displaystyle locationCounter = locationCounter + number of bytes\)
- Get next line of source code

The symbol table is carried from the first pass to the second. The second pass also consults a table of operation codes, which provides the machine code corresponding to each instruction mnemonic. A highly simplified description of the second pass is given in Algorithm 9.5.2.

Algorithm 9.5.2. Second pass of a two-pass assembler.

\(\displaystyle locationCounter = 0\)
Get first line of source code
While (more lines of code)
- If (line is an instruction)
  - Find machine code from opCodeTable
  - Find symbol value form symbolTable
  - Assemble instruction into machine code
- Else
  - Carry out assembler directive
- Write machine code to object file
- Determine number of bytes used
- \(\displaystyle locationCounter = locationCounter + number of bytes\)
- Get next line of source code

Subsection 9.5.2 Linkers

You have already seen functions that call other functions by name, which are located in another file or library. The assembler has no way to determine the address of these other labels for the symbol table during the first pass. The only thing the assembler can do during the second pass is to leave enough memory space for the address when it assembles this instruction. The actual address will have to be filled in later in order to create the entire program. Filling in these references to external memory locations is the job of the linker program.

Actually, many of the functions used by a program are not even included in the executable program file. They are loaded as required when the program is executing. The linker program must provide dynamic links for the executable program file.

The mechanisms for implementing dynamic function loading is beyond the scope of this book. However, you can get the general idea of linking separately assembled (or compiled) functions together by studying Algorithm 9.5.3 and Algorithm 9.5.4. First, notice that the assembler (or compiler) must include other information in addition to machine code in the object file. The additional information includes:

The name of the function.
The name of each external memory reference.
The location relative to the beginning of the function where the external memory reference is made.

Algorithm 9.5.3. First pass of a two-pass assembler.

\(\displaystyle locationCounter = 0\)
Open first object file
While (more object files)
- globalSymbolTable.symbol = functionName
- globalSymbolTable.location = locationCounter
- Determine number of bytes required by the function.
- \(\displaystyle locationCounter = locationCounter + number of bytes\)
- Open next object file

Algorithm 9.5.4. Second pass of a two-pass assembler.

\(\displaystyle memoryPointer = address from Operating System\)
Open first object file
While (more object files)
- While (more machine code)
  - *memoryPointer = byte of code read from object file
  - memoryPointer = memoryPointer + 1
- While (more external memory references)
  - Get value from globalSymbolTable
  - Determine where value should be stored
  - Store value in code that was just loaded
- Open next object file

In Algorithm 9.5.4 “*memoryPointer” is the C/C++ syntax to specify that the variable, memoryPointer, holds the address of the value being referenced. The ‘*’ character is the dereference operator. So in this algorithm, the byte of code that is read from the object file is stored at whatever address is currently in the memoryPointer variable. The next step in the algorithm adds \(1\) to this variable, so it then points to the next byte in memory. Variables used in this way are called pointer variables.