Page 395 - DSP Integrated Circuits
P. 395
380 Chapter 8 DSP Architectures
Shared-memory architectures are well
suited for tightly coupled DSP algo-
rithms—for example, recursive algorithms
with complicated data dependencies.
Unfortunately, the shared-memory archi-
tecture can accommodate only a small
number of processors due to the memory
bandwidth bottleneck. In many DSP appli-
cations the required work load is too large
for a single shared-memory architecture
based on current circuit technologies. For-
tunately, top-down synthesis techniques
tend to produce systems composed of Figure 8.26 Multiprocessor
loosely coupled subsystems that are tightly architecture
coupled internally. Typically, a signal pro-
cessing system consists of a mix of subsystems in parallel and cascade. The system
is often implemented using a message-based architecture since the subsystems
usually have relatively low intercommunication requirements, while the tightly
coupled subsystems are implemented using shared-memory architectures. Gener-
ally, the subsystems are fully synchronous, while global communication may be
asynchronous.
8.9.1 Memory Bandwidth Bottleneck
The major limitation of shared-memory architecture is the well-known memory
bandwidth bottleneck. Each processor must be allocated two memory time slots:
one for receiving inputs and the other for storing the output value into the memo-
ries. To fully utilize N processors with execution time TPE, the following inequality
must hold:
where TM is the cycle time for the memories. However, TPE and TM are of the same
order. Hence, very few processors can be kept busy because of this memory band-
width bottleneck. Thus, there is a fundamental imbalance between computational
capacity and communication bandwidth in shared-memory architecture. In the fol-
lowing sections we will discuss some methods to counteract this imbalance and
reduce the implementation cost.
For simplicity, we assume that the PEs perform only constant-time opera-
tions. TM = TR = TW are the read and write times for the memories. According to
inequality (8.1), there are only three factors that can be modified by the design. In
section 8.9.2, we will discuss methods of reducing the cycle time of the memories,
and in section 8.9.3, we will discuss methods of reducing the number of memory-
PE transactions. Finally, in Chapter 9, we will propose an efficient method based
on slow PEs.
8.9.2 Reducing the Memory Cycle Time
The effective cycle time can be reduced by interleaving memories [9]. Each mem-
ory in the original architecture is substituted with K memories, as illustrated in