
General Description of AST's CDSP-k5 Customizable Complex DSP Core
(preliminary information)
The CDSP-k5 is the fifth member in a family of high
performance customizable fixed-point DSP cores. It is meant to be used
as an embedded cell in ASICs developed on most of the 0.25u and below technologies.
It is highly customizable and can be targeted at a large number of technologies
thanks to its parameterized, HDL-only based design.
The CDSP-k5 processes two words in parallel (up to 32 bits each), interpreted
as either a complex number, or as a pair of real numbers. The CDSP-k5 hosts a 4-unit complex multiplier allowing speeds
of one ComplexMAC/Clock cycle, or two parallel RealMACs/Clock cycle; in
both cases saturation logic can be enabled to assist these operations.
The CDSP-k5 includes a high performance, customizable hardware acceleration
unit that has been optimized for most of the common DSP algorithms (MAC/FIR/Correlation,
LMS, IIR, FFT/iFFT, Search/Min/Max, Matrix/Vector operations). The modular
design of the core allows stripped-down versions to be easily obtained,
while a number of list-box/check-box customizable features enable on-the-fly
tuning of the design to match the user's specifications. The user-guided
customization process can thus achieve a highly efficient, low power and
small area implementation, making the CDSP-k5 well suited for high-volume,
low-cost applications, while still delivering world-class performance.
Some of the CDSP-k5 general registers can be used to interface application-specific
hardware accelerators; this offers the user a convenient and effective
way to tightly interact with the internal CDSP structure. Also, both the
ALU and the MAC can be completely replaced and/or complemented with user-defined
hardware structures.
A number of productivity tools have been developed to ease the elaboration/deploying
of DSP applications on the CDSP. These include an Assembly Language Integrated
Development Environment (aIDE)
and a collection of standard DSP functions (CDSPLib).
A K&R C Language Integrated Development Environment (cIDE) is currently
under development
Architectural features:
-
Single-cycle execution for most instructions.
-
Operates directly on complex numbers, or on pairs of real numbers
-
Highly orthogonal, two-operand instruction set, with one operand residing
in a register, and the other in a register or memory location
-
Unified data memory addressing replaces the traditional DSPs' X and Y data
memories
-
Configurable MAC unit optimized for most of the common DSP algorithms,
enabling execution speeds comparable with the cutting edge parallel DSP
processors on the market.
-
Saturation logic built in both the ALU and the MAC units
-
Up to four index registers fully featured with modulo and bit-reversed
post-increment addressing capability
-
Zero-cycle Block-repeat capability plus a standard looping instruction
-
Dynamic shift instruction, plus a choice of static shifts (both arithmetic
and logic).
-
Compact code and large addressing space
-
Low power dissipation achieved by blocking the logic modules that are inactive
in every clock cycle
-
Less than one cycle response when in wait mode allowing fast synchronization
with predictable asynchronous events
-
Six internal 64-bit data busses enabling up to six internal complex-data
transfers per cycle, or twelve internal real-data transfers per cycle
-
Special bank-based memory architecture enabling efficient usage of data
types that are smaller than a processor word
-
Synchronous program memory implementable as a RAM/ROM combination, enabling
the DSP with run-time programmability
-
Interface registers to allow application-specific hardware acceleration
modules to be tightly integrated with the core
Customizable features include:
-
The size of the processor word (up to 32 bits)
-
The RAM and ROM sizes
-
The number of general registers
-
The number of index registers and the features of the address generators,
including modulo and bit-reversed addressing modes
-
The performance of the MAC unit, ranging from a simple, one result-bit
per cycle multiplier, up to state-of-the-art, fully pipelined, single-cycle
complex hardware accelerator
-
The saturation and rounding options built in the ALU and the MAC
-
The amount of shifting for the static shift instructions
-
The addressing space (up to 2 GW)
-
The number, size and operation mode of the communication ports
-
And more...
Performance for a typical 0.25u/3V technology implementation:
-
The CDSP-k5 is implemented in two versions: a 4-stage pipeline version
CDSP-k5-4, and a 6-stage pipeline version CDSP-k5-6. The 4-stage version
consolidates (chains) memory accesses with internal DSP processing in the
same clock cycle, while the 6-stage version pipelines the memory accesses
and internal DSP processing; this leads to a double clock speed for the
6-stage version as compared to the 4-stage version. The critical path inside
the CDSP-k5 is less than 5ns, leading to 200MHz operation for the 6-stage
version (CDSP-k5-6), and 100MHz operation for the 4-stage version (CDSP-k5-4).
-
The CDSP-k5 operates at a sustained rate of 100MIPS at 100MHz, or 200MIPS
at 200MHz. A very
high performance is achieved for typical DSP algorithms by having up
to eight internal arithmetic units, plus two address generator units, working
in parallel every clock cycle. The CDSP's ALU and MAC units have
been desigend with special emphasis on efficient usage of the hardware
resources during typical DSP algorithms, leading to effective 2GOPS (Giga Operations
Per Second) speeds. Examples of algorithms that fully utilize
this computing power are Complex FIR, Complex Correlation, Complex Matrix
Multiplication, Complex Energy calculation. For other algorithms such as
Real FIR, Real Correlation, Real Energy calculation, FFT, iFFT, LMS-based
Complex FIR update and Echo Calcellation, speeds between1GOPS and 1.5GOPS
are obtained.