Section 2.13 Storing Characters
Almost all programs perform a great deal of text string manipulation. Text strings are made up of arrays of characters. The first program you wrote was probably a “Hello world” program. If you wrote it in C, you used a statement like:
printf("Hello world\n");
or in C++:
cout << "Hello world" << endl;
When translating either of these statements into machine code, the compiler must do two things:
store each of the characters in a location in memory where the control unit can access them, and
generate the machine instructions to write the characters on the screen.
We start by considering how a single character is stored in memory. There are many codes for representing characters, but the most common one is the American Standard Code for Information Interchange (ASCII—pronounced “ask' e”). It uses seven bits to represent each character. Table 2.13.1 shows the bit patterns for each character in hexadecimal. If you are at your computer, you can generate this table by typing the command man ascii
.
bit | bit | bit | bit | ||||
pat. | char | pat. | char | pat. | char | pat. | char |
\(\hex{00}\) |
NUL (Null) |
\(\hex{20}\) | (space) |
\(\hex{40}\) | @ |
\(\hex{60}\) | ` |
\(\hex{01}\) |
SOH (Start of Heading) |
\(\hex{21}\) | ! |
\(\hex{41}\) | A |
\(\hex{61}\) | a |
\(\hex{02}\) |
STX (Start of Text) |
\(\hex{22}\) | " |
\(\hex{42}\) | B |
\(\hex{62}\) | b |
\(\hex{03}\) |
ETX (End of Text) |
\(\hex{23}\) | # |
\(\hex{43}\) | C |
\(\hex{63}\) | c |
\(\hex{04}\) |
EOT (End of Transmit) |
\(\hex{24}\) | $ |
\(\hex{44}\) | D |
\(\hex{64}\) | d |
\(\hex{05}\) |
ENQ (Enquiry) |
\(\hex{25}\) | % |
\(\hex{45}\) | E |
\(\hex{65}\) | e |
\(\hex{06}\) |
ACK (Acknowledge) |
\(\hex{26}\) | & |
\(\hex{46}\) | F |
\(\hex{66}\) | f |
\(\hex{07}\) |
BEL (Bell) |
\(\hex{27}\) | ' |
\(\hex{47}\) | G |
\(\hex{67}\) | g |
\(\hex{08}\) |
BS (Backspace) |
\(\hex{28}\) | ( |
\(\hex{48}\) | H |
\(\hex{68}\) | h |
\(\hex{09}\) |
HT (Horizontal Tab) |
\(\hex{29}\) | ) |
\(\hex{49}\) | I |
\(\hex{69}\) | i |
\(\hex{0a}\) |
LF (Line Feed) |
\(\hex{2a}\) | * |
\(\hex{4a}\) | J |
\(\hex{6a}\) | j |
\(\hex{0b}\) |
VT (Vertical Tab) |
\(\hex{2b}\) | + |
\(\hex{4b}\) | K |
\(\hex{6b}\) | k |
\(\hex{0c}\) |
FF (Form Feed) |
\(\hex{2c}\) | , |
\(\hex{4c}\) | L |
\(\hex{6c}\) | l |
\(\hex{0d}\) |
CR (Carriage Return) |
\(\hex{2d}\) | - |
\(\hex{4d}\) | M |
\(\hex{6d}\) | m |
\(\hex{0e}\) |
SO (Shift Out) |
\(\hex{2e}\) | . |
\(\hex{4e}\) | N |
\(\hex{6e}\) | n |
\(\hex{0f}\) |
SI (Shift In) |
\(\hex{2f}\) | / |
\(\hex{4f}\) | O |
\(\hex{6f}\) | o |
\(\hex{10}\) |
DLE (Data-Link Escape) |
\(\hex{30}\) | 0 |
\(\hex{50}\) | P |
\(\hex{70}\) | p |
\(\hex{11}\) |
DC1 (Device Control 1) |
\(\hex{31}\) | 1 |
\(\hex{51}\) | Q |
\(\hex{71}\) | q |
\(\hex{12}\) |
DC2 (Device Control 2) |
\(\hex{32}\) | 2 |
\(\hex{52}\) | R |
\(\hex{72}\) | r |
\(\hex{13}\) |
DC3 (Device Control 3) |
\(\hex{33}\) | 3 |
\(\hex{53}\) | S |
\(\hex{73}\) | s |
\(\hex{14}\) |
DC4 (Device Control 4) |
\(\hex{34}\) | 4 |
\(\hex{54}\) | T |
\(\hex{74}\) | t |
\(\hex{15}\) |
NAK (Negative ACK) |
\(\hex{35}\) | 5 |
\(\hex{55}\) | U |
\(\hex{75}\) | u |
\(\hex{16}\) |
SYN (Synchronous idle) |
\(\hex{36}\) | 6 |
\(\hex{56}\) | V |
\(\hex{76}\) | v |
\(\hex{17}\) |
ETB (End of Trans. Block) |
\(\hex{37}\) | 7 |
\(\hex{57}\) | W |
\(\hex{77}\) | w |
\(\hex{18}\) |
CAN (Cancel) |
\(\hex{38}\) | 8 |
\(\hex{58}\) | X |
\(\hex{78}\) | x |
\(\hex{19}\) |
EM (End of Medium) |
\(\hex{39}\) | 9 |
\(\hex{59}\) | Y |
\(\hex{79}\) | y |
\(\hex{1a}\) |
SUB (Substitute) |
\(\hex{3a}\) | : |
\(\hex{5a}\) | Z |
\(\hex{7a}\) | z |
\(\hex{1b}\) |
ESC (Escape) |
\(\hex{3b}\) | ; |
\(\hex{5b}\) | [ |
\(\hex{7b}\) | { |
\(\hex{1c}\) |
FS (File Separator) |
\(\hex{3c}\) | < |
\(\hex{5c}\) | \ |
\(\hex{7c}\) | | |
\(\hex{1d}\) |
GS (Group Separator) |
\(\hex{3d}\) | = |
\(\hex{5d}\) | ] |
\(\hex{7d}\) | } |
\(\hex{1e}\) |
RS (Record Separator) |
\(\hex{3e}\) | > |
\(\hex{5e}\) | ^ |
\(\hex{7e}\) | ~ |
\(\hex{1f}\) |
US (Unit Separator) |
\(\hex{3f}\) | ? |
\(\hex{5f}\) | _ |
\(\hex{7f}\) | DEL |
This is not the sort of table that you would memorize. However, you should become familiar with some of its general characteristics. In particular, notice that the numerical characters, 0
–9
, are in a contiguous sequence in the code, \(\hex{30}\)–\(\hex{39}\text{.}\) The same is true of the lower case alphabetic characters, a
–z
, and of the upper case characters, A
–Z
. Notice that the lower case alphabetic characters are numerically higher than the upper case.
The codes in the left-hand column of Table 2.13.1, \(\hex{00}\)–\(\hex{1f}\text{,}\) define control characters. The ASCII code was developed in the 1960s for transmitting data from a sender to a receiver. If you read some of names of the control characters, you can imagine how they could be used to control the “dialog” between the sender and receiver. They can be generated on a keyboard by holding the control key down while pressing an alphabetic key. For example, ctrl-d
generates an EOT
(End of Transmission) character.
ASCII codes are usually stored in the rightmost seven bits of an eight-bit byte. The eighth bit (the highest-order bit) is called the parity bit. It can be used for error detection in the following way. The sender and receiver would agree ahead of time whether to use even parity or odd parity. Even parity means that an even number of ones is always transmitted in each character; odd parity means that an odd number of ones is transmitted. Before transmitting a character in the ASCII code, the sender would adjust the eighth bit such that the total number of ones matched the even or odd agreement. When the code was received, the receiver would count the ones in each eight-bit byte. If the sum did not match the agreement, the receiver knew that one of the bits in the byte had been received incorrectly. Of course, if two bits had been incorrectly received, the error would pass undetected, but the chances of this double error are remarkably small. Modern communication systems are much more reliable, and parity is seldom used when sending individual bytes.
In some environments the high-order bit is used to provide a code for special characters. A little thought will show you that even all eight bits will not support all languages, e.g., Greek, Russian, Chinese. The Unicode character standard was first introduced in 1987 and has evolved over the years. It includes additional bytes so it can handle other alphabets. Unicode is backwards compatible with ASCII. We will only use ASCII in this book.
A computer system that uses an ASCII video system can be programmed to send a byte to the screen. The video system interprets the bit pattern as an ASCII code (from Table 2.13.1) and displays the corresponding character on the screen.
Getting back to the text string, "Hello world\n"
, the compiler would store this as a constant array of characters. There needs to be a way to specify the length of this array. In a C-style string this is accomplished by using the sentinel character NUL
at the end of the string. So the compiler must allocate thirteen bytes for this string. An example of how this string is stored in memory is shown in Figure 2.13.2. Notice that C uses the LF
character as a single newline character even though the C syntax requires that the programmer write two characters, “\n
”. The area of memory shown includes the three bytes immediately following the text string.
Address | Contents |
\(\hex{4004a1}:\) | \(\hex{48}\) |
\(\hex{4004a2}:\) | \(\hex{65}\) |
\(\hex{4004a3}:\) | \(\hex{6c}\) |
\(\hex{4004a4}:\) | \(\hex{6c}\) |
\(\hex{4004a5}:\) | \(\hex{6f}\) |
\(\hex{4004a6}:\) | \(\hex{20}\) |
\(\hex{4004a7}:\) | \(\hex{77}\) |
\(\hex{4004a8}:\) | \(\hex{6f}\) |
\(\hex{4004a9}:\) | \(\hex{72}\) |
\(\hex{4004aa}:\) | \(\hex{6c}\) |
\(\hex{4004ab}:\) | \(\hex{64}\) |
\(\hex{4004ac}:\) | \(\hex{0a}\) |
\(\hex{4004ad}:\) | \(\hex{00}\) |
\(\hex{4004ae}:\) | \(\hex{25}\) |
\(\hex{4004af}:\) | \(\hex{73}\) |
\(\hex{4004b0}:\) | \(\hex{00}\) |
In Pascal the length of the string is specified by the first byte in the string. It is taken to be an 8-bit unsigned integer. So C-style strings are typically processed by sentinel-controlled loops, and count-controlled string processing loops are more common in Pascal. The C++ string class has additional features, but the actual text string is stored as a C-style text string within the C++ string instance.