VLSI efficient RNS scalers and arbitrary modulus residue generators
Low, Jeremy Yung Shern
Date of Issue2014
School of Electrical and Electronic Engineering
Carry propagation has been identified as the main timing bottleneck of the datapath elements of application-specific digital signal processors in the accustomed positional number system. Residue Number System (RNS), being a non-weighted number system, emerges as a good remedy to this problem. RNS gains phenomenal successes in speeding up the datapath of digital signal processing applications dominated by additions/subtractions and multiplications, such as digital filters, convolution, channelizers, equalizers, discrete transforms, etc. Despite the rapid development of RNS, it is still relatively inefficient in handling some inter-modulo operations such as data scaling directly in the residue domain. As a workaround, hybrid RNS-binary system has been widely used in the embedded inner-product-step-processor (IPSP) architecture of these applications to scale the intermediate results in binary domain to prevent overflow errors. This has resulted in a humongous amount of research on residue-to-binary conversion problems especially for special moduli set with good number theoretic properties. Such hybrid number system is not ideal as it relies heavily on the efficient RNS reverse converter, which is itself an inter-modulo operation that annihilates the modularity and parallelism of RNS. This thesis tackles the RNS scaling problem by eliminating the area intensive and slower residue-to-binary converter for a more efficient implementation of true RNS-based IPSP. A novel area-efficient, high-speed and precise RNS scaler for the celebrated three-moduli set is proposed in this thesis. The new scaling algorithm is formulated based on the Chinese Remainder Theorem and it uniquely exploits the number theoretic properties of this moduli set and the fixed scaling factor of to overcome the complexity associated with the hardware implementation of this inter-modulo operation. The proposed RNS scaler has an area complexity of and a time complexity of . The integer scaled output in normal binary number representation is also generated as a byproduct in this new formulation. Hence, the expensive residue-to-binary converter can be saved if the result after scaling is also required by a normal binary number system. To extend the usefulness of this moduli set to RNS-based adaptive signal processing applications, another elegant algorithm is developed to enable programmable power-of-two RNS scaling for the first time. With a variable scaling factor of 2r, , up to one-third of the dynamic range of an integer can be arbitrarily scaled in the residue domain directly. The architecture can be implemented entirely in combinational circuits without lookup tables, making it easy to be merged and pipelined with other circuits within the RNS. Its simplicity contributes to the simultaneous reduction of area, delay and power consumption in its hardware implementation. As the binary scaled output can also be made available by the proposed method, it will also ease the magnitude comparison of scaled integers. While RNS constructed by moduli of the form 2n and 2n±1 possesses good number theoretic properties, it has limited parallelism and asymmetrical modulus wordlengths due to the limited number of moduli that can be selected to fulfill the relative primality requirement of RNS. The size of one or more moduli has to be increased as the dynamic range increases, which leads to the degradation in overall system performance for high dynamic range computations. Rather than enlarging the sizes of one or more moduli to extend the dynamic range of an RNS, the cardinality of the RNS can be increased by the use of arbitrary moduli set as the complexity of the RNS-to-binary converter is cardinality insensitive once the cardinality exceeds certain threshold. For high dynamic range applications, a valid RNS can be formed with relative ease by selecting as many moduli as desired from plentiful small integers. This thesis also addresses the performance bottleneck of the generation of different arbitrary residues of diverse cyclic periodicity to prevent the advantages of balanced and high-cardinality general RNS from being offset by their hardware implementation overheads. A new approach to the unitary design of efficient residue generators for arbitrary moduli is thus presented. The proposed design requires at most seven stages of carry save addition (CSA), one lookup table (LUT) of no more than seven-bit input and a small modular adder for input wordlength as large as 64 bits and modulus of up to six bits wide, making it the fastest, smallest and most power efficient residue generator architecture for any input and modulus within these ranges. It has significantly reduced the area of the fastest memory-based design and the timing of the most efficient memoryless design reported thus far. Moreover, the disparity due to the inconsistent periodicity of different moduli of the latter design has been minimized by the proposed depth-constrained CSA tree and periodicity independent LUT and modified modular adder. The latters are made possible by the ingenious use of distributive property in place of periodicity property to limit the width of the CSA tree so that large size LUTs and modular adder tree can be eliminated.
DRNTU::Engineering::Electrical and electronic engineering::Electronic systems::Signal processing