Page 396 - DSP Integrated Circuits
P. 396

8.9 Shared-Memory Architectures                                       381


        Figure 8.27. Operation of the memories is skewed in time. The general idea is to
        arrange the memories so that K sequential memory accesses fall in K distinct
        memories, allowing K accesses to be under way simultaneously. In fact, this
        arrangement is equivalent to pipelining the memory accesses. The effective mem-
        ory cycle time is reduced by a factor K.
            By choosing K = 2N, a good balance is obtained.
        Interleaving leads to expensive overhead because of
        the necessary duplication of decoders, sense amplifi-
        ers, etc. Furthermore, losses in processing efficiency
        may be incurred by memory access conflicts.
            Another method of reducing the right-hand side of
        inequality (8.1), by a factor of two, is to use two sepa-
        rate sets of memories. Results from the PEs are writ-
        ten into one set while the other set is used for reading
        values which will become inputs to the PEs. The role of
        these two sets of memories alternates every other PE
        cycle. In this way memory conflicts are avoided.  Figure 8.27 Interleaving
                                                                     of K memories
            Vector PEs usually employ interleaving of memo-
        ries for accessing sequences of vector elements placed
        in consecutive memories. The memory elements do not need to be consecutive. It is
        enough that the access pattern is known in advance.

        8.9.3 Reducing Communications

        Most schemes to reduce the memory bandwidth requirement exploit some inher-
        ent property of the DSP algorithm to reduce the number of memory accesses. We
        will discuss only the most common techniques.

        Broadcasting
        The number of memory read cycles can be
        reduced if all PEs operate on the same input
        data. Figure 8.28 shows an array of processing
        elements supported by K memories [1]. The
        outputs from the memories are broadcast to
        the PEs. Hence, only one memory read cycle is
        needed. The required number of write cycles
        depends on the algorithm. Only one write cycle
        is needed if the results from the PEs are writ-
        ten into different memories such that access
        conflicts do not occur (N < K), but N cycles are
        needed if all results are written into the same
        memory. Hence, we have



            Thus, a good balance between processing  Figure 8.28 Broadcasting of data
        capability and communication bandwidth can
        be obtained for certain types of algorithms. This type of architecture can easily be
        time-shared among several input channels. However, it has too large a processing
        capacity for many applications.
   391   392   393   394   395   396   397   398   399   400   401