Page 161 - A Practical Guide from Design Planning to Manufacturing
P. 161

134   Chapter Five

        while that instruction is waiting, the processor can execute instructions
        from the second thread, which do not depend upon the first thread’s
        results. The processor spends less time idle and more time performing
        useful work.
          Out-of-order issue and superscalar issue are purely microarchitectural
        changes, which improve performance without any change in software.
        HyperThreading is an example of how architecture and microarchitecture
        working together can achieve even more performance at the cost of requir-
        ing new software.


        Designing for Performance

        What we really want from a fast computer is to quickly run the programs
        that we care about. Better performance simply means less program run
        time.

                                               ×
             Performance ∝     1    =  frequency instructionss per cycle
                           run time          instruction count

          To improve performance, we must increase frequency or average instruc-
        tions per cycle (IPC) or reduce the number of instructions required. Choices
        in architecture will affect the IPC and instruction count. Choices in microar-
        chitecture seek to improve frequency or IPC. Performance is improved by
        increasing either, but in general, changes that improve one make the
        other worse. The easiest way to improve frequency is by increasing the
        pipeline depth, dividing the execution of each instruction into smaller faster
        cycles.
          The examples earlier in this chapter had a pipeline depth of 4, divid-
        ing instructions into 4 cycles. If instructions are balls rolling down a pipe,
        this means the time between adding new balls to the pipe is equal to 1/4
        the time a ball takes to roll the length of the pipe. Adding new balls
        at this rate means there will always be 4 balls in the pipe at one time.
        If a pipeline depth of 8 had been chosen instead, the time between adding
        new balls would be 1/8 the time to roll the entire length and there would
        be 8 balls in the pipe at one time. Doubling the pipeline depth has doubled
        rate balls are added by cutting in half the time between new balls.
          The time between adding new balls to the pipe is equivalent to a
        processor’s cycle time, the inverse of processor frequency. Each instruc-
        tion has a certain amount of logic delay determined by the type of
        instruction. Increasing the pipeline depth divides this computation
        among more cycles, allowing the time for each cycle to be less. Ideally,
        doubling the pipeline depth would allow twice the frequency. In reality,
        some amount of delay overhead is added with each new pipestage. This
   156   157   158   159   160   161   162   163   164   165   166