Page 170 -
P. 170
4.4 / PENTIUM 4 CACHE ORGANIZATION 141
Table 4.4 Intel Cache Evolution
Processor on which
Problem Solution Feature First Appears
External memory slower than the system Add external cache using faster 386
bus. memory technology.
Increased processor speed results in Move external cache on-chip, op- 486
external bus becoming a bottleneck for erating at the same speed as the
cache access. processor.
Internal cache is rather small, due to Add external L2 cache using faster 486
limited space on chip technology than main memory
Contention occurs when both the Instruc- Create separate data and instruc- Pentium
tion Prefetcher and the Execution Unit tion caches.
simultaneously require access to the cache.
In that case, the Prefetcher is stalled while
the Execution Unit’s data access takes
place.
Create separate back-side bus that Pentium Pro
runs at higher speed than the main
Increased processor speed results in (front-side) external bus.The BSB
external bus becoming a bottleneck for L2 is dedicated to the L2 cache.
cache access.
Move L2 cache on to the proces- Pentium II
sor chip.
Some applications deal with massive data-
Add external L3 cache. Pentium III
bases and must have rapid access to large
amounts of data.The on-chip caches are
Move L3 cache on-chip. Pentium 4
too small.
set-associative organization. All of the Pentium processors include two on-chip L1
caches, one for data and one for instructions. For the Pentium 4, the L1 data cache
is 16 KBytes, using a line size of 64 bytes and a four-way set-associative organiza-
tion. The Pentium 4 instruction cache is described subsequently. The Pentium II
also includes an L2 cache that feeds both of the L1 caches. The L2 cache is eight-
way set associative with a size of 512 KB and a line size of 128 bytes. An L3 cache
was added for the Pentium III and became on-chip with high-end versions of the
Pentium 4.
Figure 4.18 provides a simplified view of the Pentium 4 organization, high-
lighting the placement of the three caches.The processor core consists of four major
components:
• Fetch/decode unit: Fetches program instructions in order from the L2 cache,
decodes these into a series of micro-operations, and stores the results in the L1
instruction cache.
• Out-of-order execution logic: Schedules execution of the micro-operations
subject to data dependencies and resource availability; thus, micro-operations
may be scheduled for execution in a different order than they were fetched
from the instruction stream. As time permits, this unit schedules speculative
execution of micro-operations that may be required in the future.

