Page 156 - A Practical Guide from Design Planning to Manufacturing

P. 156

Microarchitecture 129

Cycle 1 2 3 4 5 6 7 8
Instr 1 Fetch Decode Execute Write
Instr 2 Fetch Decode Execute Write
Figure 5-1 Sequential processing.

defines how software should run and part of this is the expectation that
programs will execute instructions one at a time. However, there are
many instructions within programs that could be executed in parallel
or at least overlapped. Microarchitectures that take advantage of this
can provide higher performance, but to do this while providing soft-
ware compatibility, the illusion of linear execution must be maintained.
Pipelining provides higher performance by allowing execution of dif-
ferent instructions to overlap.
The earliest processors did not have sufficient transistors to support
pipelining. They processed instructions serially one at a time exactly as
the architecture defined. A very simple processor might break down
each instruction into four steps.

1. Fetch. The next instruction is retrieved from memory.
2. Decode. The type of operation required is determined.
3. Execute. The operation is performed.
4. Write. The instruction results are stored.

All modern processors use clock signals to synchronize their operation
both internally and when interacting with external components. The
operation of a simple sequential processor allocating one clock cycle per
instruction step would appear as shown in Fig. 5-1.
A pipelined processor improves performance by noting that separate
parts of the processor are used to carry out each of instruction steps (see
Fig. 5-2). With some added control logic, it is possible to begin the next
instruction as soon as the last instruction has completed the first step.
We can imagine individual instructions as balls that must roll from
one end of a pipe to another to complete. Processing them sequentially
is like adding a new ball only after the last comes out. By allowing mul-
tiple balls to be in transit inside the pipe at the same time, the rate that
the balls are loaded into the pipe is improved. This improvement hap-
pens even though the total time it takes one ball to roll the length of the
pipe has not changed. In processor terms, the instruction bandwidth has
improved even though instruction latency has remained the same.
Our simple sequential processor completed an instruction only every
4 cycles, but ideally a pipelined processor could complete an instruction

151 152 153 154 155 156 157 158 159 160 161