Page 420 - DSP Integrated Circuits
P. 420

9.6 Shared-Memory Architectures with Bit-Serial PEs                  405

        accessed simultaneously, as illustrated in Figure 9.19. Memory has therefore to be
        assigned such that access conflicts are avoided.
            As discussed in Chapter 8, a shared-memory architecture can only accommo-
        date a few PEs, since the cycle time for the memories and the execution time for
        the processing elements are of the same order.
            The imbalance becomes even more severe for DSP algorithms, since the pro-
        cessing elements often require many input values from the memories for each
        operation. Obviously, it is not efficient to use PEs so fast that the memories
        become the bottleneck. Generally, PEs that are faster than necessary consume
        more power and chip area. An efficient way to obtain a balanced architecture is to
        increase the execution time of the PEs as much as allowed by the application
        requirements. The cost of the system can thereby be reduced.


        9.6.1 Minimizing the Cost
        The throughput per unit chip area is 1/(A TPE). As a measure of cost we use the
        reciprocal,


            Many different PE implementations are possible at the same cost, A TPE.
        Hence, the designer can make a trade-off between chip area and execution time for
        PEs at the same cost. For example, we can use bit-serial instead of bit-parallel
        PEs. In Chapter 11, we will discuss the implementation of bit-serial PEs. A bit-
        serial PE typically uses less chip area:



                                               ct
        than the corresponding bit-parallel PEs, but has longer execution time. For the
        sake of argument, we make the highly pessimistic assumption that


            Thus, the cost of executing an algorithm is


            If the speed of the PEs is reduced, then the number of PEs has to be increased
        proportionally to perform the same amount of work. Thus, this simplified analysis
        indicates that the total chip area for the PEs is almost the same for bit-serial and
        bit-parallel arithmetic. A more accurate comparison also involving other impor-
        tant factors (e.g., power consumption) is very difficult. In practice, the number of
        PEs can be more closely adjusted to the actual throughput requirement in the bit-
        serial case. Further, the chip area required for routing wires etc. can be signifi-
        cantly reduced. However, it is necessary that parallelism in the algorithm is suffi-
        ciently high to enable enough bit-serial PEs to be employed.

        9.6.2 Uniform Memory Access Rate

        To obtain a uniform memory access pattern we propose a shared-memory architec-
        ture with a multibus ICN, with each logical memory connected to its own bus. Fur-
        ther, the PEs are provided with a set of cache memories connected to the buses, as
        illustrated in Figure 9.20. Each cache memory is split into two parts, one of which
   415   416   417   418   419   420   421   422   423   424   425