Page 395 - DSP Integrated Circuits
P. 395

380                                              Chapter 8 DSP Architectures


            Shared-memory architectures are well
        suited for tightly coupled DSP algo-
        rithms—for example, recursive algorithms
        with complicated data dependencies.
        Unfortunately, the shared-memory archi-
        tecture can accommodate only a small
        number of processors due to the memory
        bandwidth bottleneck. In many DSP appli-
        cations the required work load is too large
        for a single shared-memory architecture
        based on current circuit technologies. For-
        tunately, top-down synthesis techniques
        tend to produce systems composed of          Figure 8.26 Multiprocessor
        loosely coupled subsystems that are tightly            architecture
        coupled internally. Typically, a signal pro-
        cessing system consists of a mix of subsystems in parallel and cascade. The system
        is often implemented using a message-based architecture since the subsystems
        usually have relatively low intercommunication requirements, while the tightly
        coupled subsystems are implemented using shared-memory architectures. Gener-
        ally, the subsystems are fully synchronous, while global communication may be
        asynchronous.

        8.9.1 Memory Bandwidth Bottleneck

        The major limitation of shared-memory architecture is the well-known memory
        bandwidth bottleneck. Each processor must be allocated two memory time slots:
        one for receiving inputs and the other for storing the output value into the memo-
        ries. To fully utilize N processors with execution time TPE, the following inequality
        must hold:


        where TM is the cycle time for the memories. However, TPE and TM are of the same
        order. Hence, very few processors can be kept busy because of this memory band-
        width bottleneck. Thus, there is a fundamental imbalance between computational
        capacity and communication bandwidth in shared-memory architecture. In the fol-
        lowing sections we will discuss some methods to counteract this imbalance and
        reduce the implementation cost.
           For simplicity, we assume that the PEs perform only constant-time opera-
        tions. TM = TR = TW are the read and write times for the memories. According to
        inequality (8.1), there are only three factors that can be modified by the design. In
        section 8.9.2, we will discuss methods of reducing the cycle time of the memories,
        and in section 8.9.3, we will discuss methods of reducing the number of memory-
        PE transactions. Finally, in Chapter 9, we will propose an efficient method based
        on slow PEs.


        8.9.2 Reducing the Memory Cycle Time
        The effective cycle time can be reduced by interleaving memories [9]. Each mem-
        ory in the original architecture is substituted with K memories, as illustrated in
   390   391   392   393   394   395   396   397   398   399   400