Page 184 - A Practical Guide from Design Planning to Manufacturing
P. 184

Microarchitecture  157


              Macroinstrs                Macroinstrs
              Instr cache                Instr decode
                                    MicroInstrs
              Instr decode  Branches     Trace cache  Branches
        Microinstrs
              Execution                   Execution
              Pentium III                 Pentium 4
        Figure 5-17 Trace cache.

        that will be stored in the data cache. These values are forwarded on to
        main memory from where they can be displayed to the user.
          The microarchitecture of the Pentium 4 limits the performance impact
        of having to translate macroinstructions to uops by storing uops in a
        trace cache. See Fig. 5-17.
          The Pentium III microarchitecture stores macroinstructions in its
        instruction cache. These are then translated before execution. If one of
        the instructions is a branch, the new program path must be fetched from
        the instruction cache and translated before execution can continue. In the
        Pentium 4, the instructions translated to uops before being stored in the
        cache. The instruction cache is called a trace cache because it no longer
        stores instructions as they exist in main memory, but instead stores
        decoded uop translations. When a branch is executed the already trans-
        lated uops are immediately fetched from the trace cache. The instruction
        decode delay is no longer part of the branch mispredict penalty and per-
        formance is improved. The disadvantage is that the decoded instructions
        require more storage and the trace cache must be larger in size to achieve
        the same hit rate as an instruction cache.
          In general, the use of uops allows only the translation circuits on the
        processor to see a macroinstruction. The rest of the processor operates only
        on uops, which can be changed from one processor design to the next as
        suits the microarchitecture while maintaining software compatibility. The
        use of uops has allowed processors supporting CISC architectures to use
        all of the performance enhancing features that RISC processors do and has
        made the distinction between CISC and RISC processors less important.

        Reorder, retire, and replay
        Any processor that allows out-of-order execution must have some
        mechanism for putting instructions back in order after execution. Software
        is written assuming that instructions will be executed in the order spec-
        ified in the program. To produce the expected behavior, interrupts and
        exceptions must be handled at the proper times. Branch prediction
        means that some instructions will be executed that an in-order processor
   179   180   181   182   183   184   185   186   187   188   189