Page 396 - DSP Integrated Circuits
P. 396
8.9 Shared-Memory Architectures 381
Figure 8.27. Operation of the memories is skewed in time. The general idea is to
arrange the memories so that K sequential memory accesses fall in K distinct
memories, allowing K accesses to be under way simultaneously. In fact, this
arrangement is equivalent to pipelining the memory accesses. The effective mem-
ory cycle time is reduced by a factor K.
By choosing K = 2N, a good balance is obtained.
Interleaving leads to expensive overhead because of
the necessary duplication of decoders, sense amplifi-
ers, etc. Furthermore, losses in processing efficiency
may be incurred by memory access conflicts.
Another method of reducing the right-hand side of
inequality (8.1), by a factor of two, is to use two sepa-
rate sets of memories. Results from the PEs are writ-
ten into one set while the other set is used for reading
values which will become inputs to the PEs. The role of
these two sets of memories alternates every other PE
cycle. In this way memory conflicts are avoided. Figure 8.27 Interleaving
of K memories
Vector PEs usually employ interleaving of memo-
ries for accessing sequences of vector elements placed
in consecutive memories. The memory elements do not need to be consecutive. It is
enough that the access pattern is known in advance.
8.9.3 Reducing Communications
Most schemes to reduce the memory bandwidth requirement exploit some inher-
ent property of the DSP algorithm to reduce the number of memory accesses. We
will discuss only the most common techniques.
Broadcasting
The number of memory read cycles can be
reduced if all PEs operate on the same input
data. Figure 8.28 shows an array of processing
elements supported by K memories [1]. The
outputs from the memories are broadcast to
the PEs. Hence, only one memory read cycle is
needed. The required number of write cycles
depends on the algorithm. Only one write cycle
is needed if the results from the PEs are writ-
ten into different memories such that access
conflicts do not occur (N < K), but N cycles are
needed if all results are written into the same
memory. Hence, we have
Thus, a good balance between processing Figure 8.28 Broadcasting of data
capability and communication bandwidth can
be obtained for certain types of algorithms. This type of architecture can easily be
time-shared among several input channels. However, it has too large a processing
capacity for many applications.