Page 188 - A Practical Guide from Design Planning to Manufacturing
P. 188

Microarchitecture  161

        shows the number of clock cycles allocated for each step in the execution
        pipeline for a total of 20 cycles. This is the best case pipeline length with
        many instructions taking much longer. Most importantly 20 cycles is the
        branch mispredict penalty. It is the minimum number of cycles required
        to fetch and execute a branch, to determine if the branch prediction
        was correct, and then to start fetching from the correct address if the
        prediction was wrong. Intel has not provided details about the number
        of cycles in the front-end pipeline. The following sections go through each
        of the steps in these two pipelines.
          Documentation of the details of each pipeline stage of the Pentium 4
        is not always complete. The following description is meant to be a rea-
        sonable estimation of the processor’s operation based on what has been
        publicly reported by Intel and others. 11, 12, 13


        Instruction prefetch
        When a computer is turned off, all the instructions of all the software
        are retained on the hard drive or some other nonvolatile storage. Before
        any program is run its instructions must first be loaded into main memory
        and then read by the processor. The operating system performs the job
        of loading applications into memory. The OS treats the instructions of
        other programs like data and executes its own instructions on the proces-
        sor to copy them into main memory. The OS then hands control of the
        processor over to the new program by setting the processor’s instruction
        pointer (IP) to the memory address of the program’s first instruction.
        This is when the processor begins running a new program. The proces-
        sor has decoded none of these instructions yet, so they are all macroin-
        structions. The instruction prefetcher has the task of providing a steady
        stream of macroinstructions to be decoded into uops.
          Keeping the processor busy requires loading instructions long before
        they are actually reached by the program’s execution. The prefetcher starts
        at the instruction pointer address and begins requesting 32 bytes at a
        time from the L2 cache. The instruction pointer address is actually a vir-
        tual address, which is submitted to the instruction translation lookaside
        buffer (ITLB) to be converted into a physical address before reading from
        the cache. A miss in the ITLB causes an extra memory read to load the
        ITLB with the needed page translation before continuing. When looking
        up the physical address of the needed page, it may be discovered that the
        page is not currently in memory. This causes the processor to signal a
        page fault exception and hand control back over to the operating system
        to execute the instructions needed to load the page before returning control.


          11
           “IA-32 Architecture Reference Manual.”
          12
           Hinton et al., “Microarchitecture of the Pentium 4.”
          13
           Shanley, The Unabridged Pentium 4.
   183   184   185   186   187   188   189   190   191   192   193