Page 188 - A Practical Guide from Design Planning to Manufacturing
P. 188
Microarchitecture 161
shows the number of clock cycles allocated for each step in the execution
pipeline for a total of 20 cycles. This is the best case pipeline length with
many instructions taking much longer. Most importantly 20 cycles is the
branch mispredict penalty. It is the minimum number of cycles required
to fetch and execute a branch, to determine if the branch prediction
was correct, and then to start fetching from the correct address if the
prediction was wrong. Intel has not provided details about the number
of cycles in the front-end pipeline. The following sections go through each
of the steps in these two pipelines.
Documentation of the details of each pipeline stage of the Pentium 4
is not always complete. The following description is meant to be a rea-
sonable estimation of the processor’s operation based on what has been
publicly reported by Intel and others. 11, 12, 13
Instruction prefetch
When a computer is turned off, all the instructions of all the software
are retained on the hard drive or some other nonvolatile storage. Before
any program is run its instructions must first be loaded into main memory
and then read by the processor. The operating system performs the job
of loading applications into memory. The OS treats the instructions of
other programs like data and executes its own instructions on the proces-
sor to copy them into main memory. The OS then hands control of the
processor over to the new program by setting the processor’s instruction
pointer (IP) to the memory address of the program’s first instruction.
This is when the processor begins running a new program. The proces-
sor has decoded none of these instructions yet, so they are all macroin-
structions. The instruction prefetcher has the task of providing a steady
stream of macroinstructions to be decoded into uops.
Keeping the processor busy requires loading instructions long before
they are actually reached by the program’s execution. The prefetcher starts
at the instruction pointer address and begins requesting 32 bytes at a
time from the L2 cache. The instruction pointer address is actually a vir-
tual address, which is submitted to the instruction translation lookaside
buffer (ITLB) to be converted into a physical address before reading from
the cache. A miss in the ITLB causes an extra memory read to load the
ITLB with the needed page translation before continuing. When looking
up the physical address of the needed page, it may be discovered that the
page is not currently in memory. This causes the processor to signal a
page fault exception and hand control back over to the operating system
to execute the instructions needed to load the page before returning control.
11
“IA-32 Architecture Reference Manual.”
12
Hinton et al., “Microarchitecture of the Pentium 4.”
13
Shanley, The Unabridged Pentium 4.