Low-Voltage, Low-Power CMOS Arithmetic Circuits for Energy Efficient VLSI Applications

Gu Jiangmin

School of Electrical & Electronic Engineering

A thesis submitted to the Nanyang Technological University in fulfilment of the requirement for the degree of Master of Engineering

2005
Acknowledgements

I wish to thank my supervisor, Assistant Professor Chang Chip Hong, for his efficient supervision, active academic motivation and easy-going communication throughout the research program. In particular, I wish to thank him for teaching me the philosophies of research and intangible skills, which are the most invaluable knowledge I have acquired in the research program.

I would also like to thank Associate Professor Yeo Kiat Seng for the discussion and great help in the development of the algorithm, practical layout and the fabrication of the scalar product macrocell.

I would like to thank Ms Zhang Mingyan and Ms He Yajuan for the long run friendship, thorough discussion, hitting suggestions and invaluable help pertaining to my research. I would like to thank Ms Gong Yuanyuan and Mr. Shi Lei for their tedious but indispensable work to speed up the research schedule. I would like to thank Mr. Xu Pengfei as well for both the firm friendship and kind help pertaining to the research.

To the staff and other students in Integrated System Research Laboratory, IC Design Laboratories 1 and 2, Center for Integrated Circuits and Systems and Center for High Performance Embedded Systems, I wish to convey my appreciation to all of them for their kind and friendly assistance.

I would like to express my gratitude to my parents for their consistent support and encouragement during all periods of my study.
We present in this thesis several algorithms and designs, and their IC implementation of low-voltage, low-power CMOS arithmetic circuits for energy efficient VLSI applications. The design methodologies span from circuit level, architecture level to algorithm level.

In the circuit level, we study basic gates including the XOR gate and propose a pass-transistor logic design style XOR-XNOR gate used to efficiently generate XOR and XNOR functions simultaneously and another gate to generate good drivability carry output in a novel complementary CMOS style circuit with regular layout. Based on these two gates, we propose two novel 1-bit adder cells, Hybrid-1 and Hybrid-2. The structures of 4-2 and 5-2 compressors, which are the main building blocks of multipliers, are analyzed and different CMOS logic style circuit implementations of their constituent modules are explored. We proposed a new 4-2 compressor using the novel XOR-XNOR gate and a novel 5-2 compressor architecture of 4Δ delay, both of which are able to function down to 0.6V, and feature high speed and low power characteristics.

In the algorithm and architecture level, we present a new VLSI circuit design algorithm for scalar product evaluation. It produces a novel full bit-parallel architecture of scalar product macrocell featuring a low interconnect complexity, improved power efficiency and highly efficient VLSI area utilization. More importantly, the layout regularity and scalability enhance its performance superiority in deep submicron regime well above conventional VLSI design of vector processing unit for scalar product computation. The chip is fabricated on Chartered Semiconductor Manufacturing 0.18μm CMOS technology. To test the scalar product macrocell, several auxiliary circuits including a unique delay detection circuit are proposed to enable accurate delay measurement for the core with the constraint of limited IO pins.
Summary

We exploit the redundant binary system and propose another new redundant binary multiplier based on the newly developed Covalent Redundant Binary Booth Encoding (CRBBE) algorithm. This algorithm fully exploits the characteristics of the Booth encoded numbers to overcome the problem of generating hard multiples and achieves a compatible reduction of Redundant Binary (RB) partial product without inducing any correction vector. Compared with the existing Redundant Binary Signed Digit (RBSD) multiplier, which is also used to address the hard multiple problem, the proposed CRBBE multiplier has only half the number of partial products to be summed if the same radix Booth encoders are used. Conversely, for the same number of RB partial products, the CRBBE multiplier is much simpler, thus less power consuming and faster than the RBSD multiplier and the RB partial product generation with normal Booth encoding (NBE-RBPPG) multiplier, which uses many high fan-in gates.

To optimize and simulate the arithmetic cells, we first propose an optimization procedure to size the transistor of all arithmetic cells in order to perform a fair comparison because the transistor sizing for optimal performance is technology dependent. The optimization procedure takes sweeping method to search for the optimized transistor sizes in iterations, which has been proven to be a convergent algorithm. We also propose reasonably simulation structures to compare the basic arithmetic cells, such as full adders, 4-2 and 5-2 compressors, in an environment realistic to their actual deployment in the most frequently used parallel multiplier structure.
# Contents

<table>
<thead>
<tr>
<th>Acknowledgements</th>
<th>I</th>
</tr>
</thead>
<tbody>
<tr>
<td>Summary</td>
<td>III</td>
</tr>
<tr>
<td>Contents</td>
<td>V</td>
</tr>
<tr>
<td>List of Figures</td>
<td>VII</td>
</tr>
<tr>
<td>List of Tables</td>
<td>X</td>
</tr>
<tr>
<td>List of Abbreviations</td>
<td>XI</td>
</tr>
</tbody>
</table>

## Chapter 1 Introduction

1.1 Motivation ........................................... 1
1.2 Objective ............................................ 3
1.3 Research Originality .................................. 4
1.4 Organization ......................................... 6

## Chapter 2 Literature Review

2.1 Low Power Techniques for Digital Circuit Design ........................................................................... 9
   2.1.1 Overview of power consumption of CMOS circuit ................................................................. 11
   2.1.2 Low power design techniques ............................................................................................... 14
2.2 Design of XOR and XNOR Cells ....................................... 19
2.3 Design of Full Adders ........................................... 30
2.4 Digital Multiplier Architectures: ........................................ 35
   2.4.1 Normal binary multipliers .................................................................................................. 35
   2.4.2 Redundant binary multipliers ............................................................................................... 39

## Chapter 3 Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells

3.1 Introduction ................................................. 42
3.2 Review of Existing Full Adder Cells ........................................ 44
3.3 Full Adder Architecture and Its Building Blocks ........................................................................... 47
   3.3.1 XOR/XNOR module ............................................................................................................. 48
   3.3.2 XOR module for sum output .............................................................................................. 51
   3.3.3 Carry generator module ..................................................................................................... 51
   3.3.4 Circuit Structure of Hybrid 1 and Hybrid 2 ...................................................................... 54
3.4 Simulation Results ............................................... 55
   3.4.1 Simulation environment ....................................................................................................... 55
   3.4.2 Transistor sizing optimization ............................................................................................ 59
   3.4.3 Simulation results and analysis ........................................................................................... 65
3.5 Summary ......................................................... 74

## Chapter 4 Low Voltage, Low Power 4-2 and 5-2 Compressors for Fast Arithmetic Circuits

4.1 Introduction ..................................................... 76
4.2 4-2 Compressor .................................................. 77
4.3 5-2 Compressor .................................................. 81
   4.3.1 5-2 Compressor Architectures .......................................................................................... 81
   4.3.2 Circuit of the proposed 5-2 compressor architectures ......................................................... 87
4.4 Simulation Results ................................................ 88
   4.4.1 Simulation environment ...................................................................................................... 88
   4.4.2 Simulation results of 4-2 compressors ................................................................................ 90
# List of Figures

<table>
<thead>
<tr>
<th>Figure</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>FIGURE 2.1</td>
<td>CAPACITIVE LOAD MODEL</td>
<td>11</td>
</tr>
<tr>
<td>FIGURE 2.2</td>
<td>POWER OPTIMIZATION HIERARCHY</td>
<td>15</td>
</tr>
<tr>
<td>FIGURE 2.3</td>
<td>CLASSICAL CMOS X-XNOR GATE</td>
<td>20</td>
</tr>
<tr>
<td>FIGURE 2.4</td>
<td>MIRROR CMOS X-XNOR GATE</td>
<td>21</td>
</tr>
<tr>
<td>FIGURE 2.5</td>
<td>WANG'S CROSS-BACK X-XNOR GATES</td>
<td>21</td>
</tr>
<tr>
<td>FIGURE 2.6</td>
<td>MODIFIED CROSS-BACK X-XNOR GATES</td>
<td>22</td>
</tr>
<tr>
<td>FIGURE 2.7</td>
<td>CMOS+ X/XNOR GATES</td>
<td>23</td>
</tr>
<tr>
<td>FIGURE 2.8</td>
<td>MODIFIED CMOS+ X/XNOR GATES</td>
<td>23</td>
</tr>
<tr>
<td>FIGURE 2.9</td>
<td>CPL X/XNOR GATES</td>
<td>24</td>
</tr>
<tr>
<td>FIGURE 2.10</td>
<td>LATCHED CPL X/XNOR GATES</td>
<td>24</td>
</tr>
<tr>
<td>FIGURE 2.11</td>
<td>DPL X/XNOR GATES</td>
<td>25</td>
</tr>
<tr>
<td>FIGURE 2.12</td>
<td>TRANSMISSION GATE X/XNOR GATES</td>
<td>25</td>
</tr>
<tr>
<td>FIGURE 2.13</td>
<td>PASS TRANSISTOR X/XNOR GATES</td>
<td>26</td>
</tr>
<tr>
<td>FIGURE 2.14</td>
<td>POWERLESS-XOR AND GROUNDLESS-XNOR GATES</td>
<td>26</td>
</tr>
<tr>
<td>FIGURE 2.15</td>
<td>DRIVEABILITY ENHANCEMENTS FOR X/XNOR GATES</td>
<td>27</td>
</tr>
<tr>
<td>FIGURE 2.16</td>
<td>MODIFIED CPL X-XNOR GATE</td>
<td>27</td>
</tr>
<tr>
<td>FIGURE 2.17</td>
<td>FEEDBACK X-XNOR GATE</td>
<td>28</td>
</tr>
<tr>
<td>FIGURE 2.18</td>
<td>CHENG'S X-XNOR GATE</td>
<td>28</td>
</tr>
<tr>
<td>FIGURE 2.19</td>
<td>MIRROR CMOS 3-INPUT XGATE</td>
<td>29</td>
</tr>
<tr>
<td>FIGURE 2.20</td>
<td>FANG'S 3-INPUT XGATE</td>
<td>30</td>
</tr>
<tr>
<td>FIGURE 2.21</td>
<td>CLASSICAL CMOS FULL ADDER</td>
<td>31</td>
</tr>
<tr>
<td>FIGURE 2.22</td>
<td>MIRROR CMOS FULL ADDER</td>
<td>31</td>
</tr>
<tr>
<td>FIGURE 2.23</td>
<td>FAST CARRY CMOS FULL ADDER</td>
<td>32</td>
</tr>
<tr>
<td>FIGURE 2.24</td>
<td>MODIFIED CPL FULL ADDER</td>
<td>32</td>
</tr>
<tr>
<td>FIGURE 2.25</td>
<td>LATCHED CARRY FULL ADDER</td>
<td>33</td>
</tr>
<tr>
<td>FIGURE 2.26</td>
<td>STRUCTURE OF FULL ADDER</td>
<td>34</td>
</tr>
<tr>
<td>FIGURE 2.27</td>
<td>CLASSIFICATION OF MULTIPLIERS</td>
<td>36</td>
</tr>
<tr>
<td>FIGURE 3.1</td>
<td>BLOCK DIAGRAM OF FULL ADDER</td>
<td>44</td>
</tr>
<tr>
<td>FIGURE 3.2</td>
<td>EXISTING FULL ADDER CIRCUITS</td>
<td>46</td>
</tr>
<tr>
<td>FIGURE 3.3</td>
<td>THREE-MODULE STRUCTURED FULL ADDER</td>
<td>47</td>
</tr>
<tr>
<td>FIGURE 3.4</td>
<td>CIRCUIT IMPLEMENTATIONS OF X/XNOR MODULE</td>
<td>49</td>
</tr>
<tr>
<td>FIGURE 3.5</td>
<td>SLOW STEP OUTPUT TRANSITIONS WHEN AB CHANGE FROM 01, 10 TO 00 FOR THE X/XNOR CIRCUIT OF FIG. 3.4(d) AT SUPPLY VOLTAGE OF 0.8V</td>
<td>50</td>
</tr>
<tr>
<td>FIGURE 3.6</td>
<td>X ROUTE MODULE CIRCUITS FOR SUM OUTPUT</td>
<td>51</td>
</tr>
<tr>
<td>FIGURE 3.7</td>
<td>IMPLEMENTATIONS OF THE CARRY GENERATOR MODULE USING MUX</td>
<td>52</td>
</tr>
<tr>
<td>FIGURE 3.8</td>
<td>PROPOSED CARRY GENERATOR MODULE FOR FULL ADDER</td>
<td>53</td>
</tr>
<tr>
<td>FIGURE 3.9</td>
<td>PROPOSED 1-BIT FULL ADDER CIRCUIT - HYBRID 1</td>
<td>54</td>
</tr>
<tr>
<td>FIGURE 3.10</td>
<td>PROPOSED 1-BIT FULL ADDER CIRCUIT - HYBRID 2</td>
<td>55</td>
</tr>
<tr>
<td>FIGURE 3.11</td>
<td>SIMULATION SETUP SUGGESTED IN [SHA02]</td>
<td>56</td>
</tr>
<tr>
<td>FIGURE 3.12</td>
<td>PROPOSED SIMULATION SETUP</td>
<td>57</td>
</tr>
<tr>
<td>FIGURE 3.13</td>
<td>WAVEFORM SNAPSHOTS OF THE CIRCUITS WITH ATTACHED BUFFERS (1.8V, 100MHz)</td>
<td>58</td>
</tr>
<tr>
<td>FIGURE 3.14</td>
<td>WAVEFORM SNAPSHOTS OF THE CIRCUITS WITHOUT ATTACHED BUFFERS (1.8V, 100MHz)</td>
<td>59</td>
</tr>
<tr>
<td>FIGURE 3.15</td>
<td>OPTIMIZATION FLOWCHART</td>
<td>61</td>
</tr>
<tr>
<td>FIGURE 3.16</td>
<td>POWER, DELAY AND POWER-DELAY-PRODUCT COMPARISON OF FULL ADDER CIRCUITS</td>
<td>68</td>
</tr>
<tr>
<td>FIGURE 3.17</td>
<td>NORMALIZED DELAY DIFFERENCE FACTORS BETWEEN HYBRID-2 AND C-CMOS</td>
<td>70</td>
</tr>
<tr>
<td>FIGURE 3.18</td>
<td>COMPARISON OF OUTPUT WAVEFORMS BETWEEN HYBRID-2 AND C-CMOS</td>
<td>70</td>
</tr>
<tr>
<td>FIGURE 3.19</td>
<td>LAYOUT OF THE HYBRID-1 CELL</td>
<td>72</td>
</tr>
<tr>
<td>FIGURE 3.20</td>
<td>LAYOUT OF THE HYBRID-2 CELL</td>
<td>72</td>
</tr>
</tbody>
</table>
List of Figures

FIGURE 4.1 4-2 COMPRESSOR ......................................................... 78
FIGURE 4.2 CONVENTIONAL 4-2 COMPRESSOR - 4A .......................... 78
FIGURE 4.3 LOGIC LEVEL OPTIMIZED CMOS 4-2 COMPRESSOR .......... 79
FIGURE 4.4 LOGIC DECOMPOSITION OF 4-2 COMPRESSOR - 3A ........ 80
FIGURE 4.5 LAYOUT OF THE NEW 4-2 COMPRESSOR USING THE PROPOSED XOR-XNOR CELL ............................................. 81
FIGURE 4.6 5-2 COMPRESSOR ....................................................... 82
FIGURE 4.7 5-2 COMPRESSOR BASED ON CASCADED FULL ADDERS - 6A .......................... 82
FIGURE 4.8 LOGIC DECOMPOSITION OF 4-2 COMPRESSOR - 3A ........ 83
FIGURE 4.9 PROPOSED 5-2 COMPRESSOR ARCHITECTURE - 4A ......... 86
FIGURE 4.10 IMPLEMENTATIONS OF THE CGEN1 MODULE ............... 87
FIGURE 4.11 LAYOUT OF THE PROPOSED 5-2 COMPRESSOR .......... 88
FIGURE 4.12 SIMULATION ENVIRONMENTS ...................................... 89
FIGURE 4.13 PERFORMANCES OF 4-2 COMPRESSORS (DESIGNS 2 - 6) .................. 93
FIGURE 4.14 PERFORMANCES OF 4-2 COMPRESSORS (DESIGNS 1, 3, 6 - 9) .................. 95
FIGURE 4.15 PERFORMANCES OF 5-2 COMPRESSORS (DESIGNS 1 - 5) ........ 99
FIGURE 4.16 PERFORMANCES OF 5-2 COMPRESSORS (DESIGNS 6 - 9) ........ 101
FIGURE 4.17 PERFORMANCES OF 5-2 COMPRESSORS (DESIGNS 10 - 13) .... 102
FIGURE 4.18 PERFORMANCES OF DIFFERENT 5-2 COMPRESSOR ARCHITECTURES WITH BBB CONFIGURATION (DESIGNS 1, 6, 10) .................. 104
FIGURE 4.19 PERFORMANCES OF DIFFERENT 5-2 COMPRESSOR ARCHITECTURES WITH EBB CONFIGURATION (DESIGNS 2, 7, 11) .................. 105
FIGURE 4.20 PERFORMANCES OF DIFFERENT 5-2 COMPRESSOR ARCHITECTURES WITH HYBRID CONFIGURATION (DESIGNS 3, 8, 12) .................. 107
FIGURE 4.21 PERFORMANCES OF DIFFERENT 5-2 COMPRESSOR ARCHITECTURES OF DUAL-RAIL LOGIC CONFIGURATION (DESIGNS 4, 5, 9, 13) .................. 108

FIGURE 5.1 TRADITIONAL ALGORITHM FOR VLSI DESIGN OF SCALAR PRODUCT IP CORE ............................................. 115
FIGURE 5.2 STORED CARRY FORMAT ........................................... 118
FIGURE 5.3 PROPOSED ALGORITHM FOR VLSI DESIGN OF VECTOR MULTIPLIER ............................................. 119
FIGURE 5.4 PROPOSED ARCHITECTURE FOR THE SCALAR PRODUCT CORE ............................................. 120
FIGURE 5.5 RECTANGULAR STRUCTURE OF PPA ........................... 121
FIGURE 5.6 WALLACE TREE STRUCTURE OF PPA ........................... 121
FIGURE 5.7 PARALLELOGRAM STRUCTURE OF VA ........................... 122
FIGURE 5.8 WALLACE TREE STRUCTURE OF VA ........................... 123
FIGURE 5.9 FLOOR PLANNING OF PPA WITH PPG ......................... 124
FIGURE 5.10 FLOOR PLANNING OF THE PROPOSED BIT-PARALLEL SCALAR PRODUCT CORE ............................................. 126
FIGURE 5.11 FLOOR-PLANNING OF NORMAL 16x16-BIT MULTIPLIER .......... 127
FIGURE 5.12 FLOOR PLANNING OF THE CONVENTIONAL SCALAR PRODUCT MACROCELL ............................................. 129
FIGURE 5.13 LAYOUT OF THE PARTIAL PRODUCT ACCUMULATOR ............ 133
FIGURE 5.14 LAYOUT OF THE SCALAR PRODUCT MACROCELL ............ 134
FIGURE 5.15 LOCAL AND GLOBAL POWER GROUND LINES ARRANGEMENT ............................................. 135
FIGURE 5.16 INPUT REGISTERS .................................................... 137
FIGURE 5.17 THE 36-INPUT AND GATE ......................................... 138
FIGURE 5.18 DELAY DETECTION CIRCUIT ....................................... 139
FIGURE 5.19 LAYOUT OF DELAY DETECTION CIRCUIT ......................... 139
FIGURE 5.20 COMPARISON OF POWER DISSIPATION (mW/10MHz) ........ 142
FIGURE 5.21 COMPARISON OF WORST-CASE DELAY (NS) ................. 143
FIGURE 5.22 COMPARISON OF POWER EFFICIENCY (PJ/10MHz) ........... 143

FIGURE 6.1 STRUCTURE OF BOOTH DIGITAL MULTIPLIER .................. 149
FIGURE 6.2 NBE-RBPPG BOOTH ENCODER .................................... 155
FIGURE 6.3 ONE DIGIT REDUNDANT BINARY PARTIAL PRODUCT GENERATOR FOR NBE-RBPPG ............................................. 155
FIGURE 6.4 RB ADDER FOR GENERATING THE K-TH RB DIGIT OF THE 5M HARD MULTIPLE ............................................. 157
List of Figures

FIGURE 6.5 RBSD BOOTH-2 ENCODER AND RBPP GENERATOR .................................................................157
FIGURE 6.6 RBSD BOOTH-3 ENCODER AND RBPP GENERATOR .................................................................158
FIGURE 6.7 RBSD BOOTH-4 ENCODER AND RBPP GENERATOR .................................................................159
FIGURE 6.8 CBEBE-1 ENCODER ..................................................................................................................166
FIGURE 6.9 CRBBE-1 RB PARTIAL PRODUCT GENERATOR .................................................................166
FIGURE 6.10 CRBBE-1.5 ENCODER ............................................................................................................167
FIGURE 6.11 CRBBE-1.5 RB PARTIAL PRODUCT GENERATOR .................................................................167
FIGURE 6.12 CRBBE-2 ENCODERS .............................................................................................................168
FIGURE 6.13 RB PARTIAL PRODUCT GENERATOR FOR CRBBE-2 .................................................................170
FIGURE 6.14 SCHEMATIC DIAGRAMS OF DIFFERENT RBAS .....................................................................174
FIGURE 6.15 CIRCUIT TO INHIBIT (1,1) INPUT FOR POSITIVE-NEGATIVE CODING .............................175
FIGURE 6.16 CIRCUIT IMPLEMENTATION OF RBA4 ..................................................................................175
FIGURE 6.17 CIRCUIT IMPLEMENTATION OF RB-NB CONVERTER ..............................................................176
FIGURE 6.18 PROPOSED CRBBE 54 × 54-BIT RB MULTIPLIER ...............................................................177
FIGURE 6.19 NBE-RBPPG 54 × 54-BIT RB MULTIPLIER ..............................................................................178
FIGURE 6.20 RBSD 54 × 54-BIT RB MULTIPLIER .....................................................................................179
List of Tables

TABLE 3.1 TRANSISTOR SIZES (µM) OF FULL ADDERS OPTIMIZED FOR POWER-DELAY PRODUCT ...........................................63
TABLE 3.2 POWER, DELAY AND POWER-DELAY-PRODUCT COMPARISON OF FULL ADDER CELLS ........................................66
TABLE 3.3 THE SILICON AREA OF THE FULL ADDERS ..................................................73
TABLE 3.4 THE PRE- AND POST- LAYOUT SIMULATION RESULTS OF THE HYBRID-1 AND HYBRID-2 CELLS ..........74

TABLE 4.1 CONFIGURATIONS OF THE SIMULATED 4-2 COMPRESSIONS .................................................................91
TABLE 4.2 COMPARISON OF DELAY (ns) OF 4-2 COMPRESSIONS .................................................................91
TABLE 4.3 COMPARISON OF POWER (µW) OF 4-2 COMPRESSIONS .................................................................91
TABLE 4.4 COMPARISON OF POWER EFFICIENCY (FI) OF 4-2 COMPRESSIONS ................................................92
TABLE 4.5 CONFIGURATIONS OF THE SIMULATED 5-2 COMPRESSIONS .................................................................96
TABLE 4.6 COMPARISON OF DELAY (ns) OF 5-2 COMPRESSIONS .................................................................97
TABLE 4.7 COMPARISON OF POWER (µW) OF 5-2 COMPRESSIONS .................................................................97
TABLE 4.8 COMPARISON OF POWER EFFICIENCY (FI) OF 5-2 COMPRESSIONS ................................................98

TABLE 5.1 THE SIZES OF DIFFERENT LEAF CELLS (ON CSM 0.18µm CMOS PROCESS) ................................................124
TABLE 5.2 FLOOR PLAN CHARACTERISTICS OF THE TWO ARCHITECTURES ................................................130
TABLE 5.3 ESTIMATION OF THE WORST CASE DELAY .................................................................132
TABLE 5.4 COMPARISON OF POWER, DELAY, AND POWER EFFICIENCY .................................................................141

TABLE 6.1 SIGN-MAGNITUDE CODING .................................................................147
TABLE 6.2 POSITIVE-NEGATIVE CODING .................................................................147
TABLE 6.3 POSITIVE-NEGATIVE COMPLEMENT CODING .................................................................148
TABLE 6.4 BOOTH-1 ENCODING .................................................................150
TABLE 6.5 BOOTH-2 ENCODING .................................................................151
TABLE 6.6 BOOTH-3 ENCODING .................................................................151
TABLE 6.7 BOOTH-4 ENCODING .................................................................152
TABLE 6.8 RBSD BOOTH-3 ENCODING .................................................................156
TABLE 6.9 RBSD BOOTH-4 ENCODING .................................................................156
TABLE 6.10 PERMISSIBLE PAIRS OF CONTIGUOUS DIGITS (Dj⁻¹, Dj) IN BOOTH-1 ENCODED NUMBER .................................................................160
TABLE 6.11 PROPOSED COVALENT RB BOOTH-1 ENCODED DUPLET (Dj⁻¹, Dj) .................................................................161
TABLE 6.12 PERMISSIBLE PAIRS OF CONTIGUOUS DIGITS (Dj⁻¹, Dj) IN BOOTH-2 ENCODED NUMBER .................................................................162
TABLE 6.13 PROPOSED COVALENT RB BOOTH-2 ENCODED DUPLETS (Dj⁻¹, Dj) .................................................................163
TABLE 6.14 PERMISSIBLE PAIRS OF CONTIGUOUS DIGITS (Dj⁻¹, Dj) IN COMPOSITE BOOTH-1 AND BOOTH-2 ENCODED NUMBER .................................................................164
TABLE 6.15 PROPOSED COVALENT RB BOOTH-1.5 ENCODED DUPLETS (Dj⁻¹, Dj) .................................................................164
TABLE 6.16 CARRY-FREE ADDING RULES FOR RBA .................................................................173
TABLE 6.17 SIMULATION RESULTS OF RB 54x54-BIT MULTIPLIERS .................................................................180
# List of Abbreviations

<table>
<thead>
<tr>
<th>Abbreviation</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>ALU</td>
<td>Arithmetic Logic Unit</td>
</tr>
<tr>
<td>ASIC</td>
<td>Application Specific Integrated Circuit</td>
</tr>
<tr>
<td>BEPPG</td>
<td>Booth Encoding and Partial Product Generation</td>
</tr>
<tr>
<td>C-CMOS</td>
<td>Classical Complementary Metal Oxide Semiconductor</td>
</tr>
<tr>
<td>CMOS</td>
<td>Complementary Metal Oxide Semiconductor</td>
</tr>
<tr>
<td>CPL</td>
<td>Complementary Pass-Transistor Logic</td>
</tr>
<tr>
<td>CPU</td>
<td>Central Processing Unit</td>
</tr>
<tr>
<td>CRBBE</td>
<td>Covalent Redundant Binary Booth Encoding</td>
</tr>
<tr>
<td>CSA</td>
<td>Carry Save Adder</td>
</tr>
<tr>
<td>DPL</td>
<td>Dual Pass Transistor Logic</td>
</tr>
<tr>
<td>DSM</td>
<td>Deep Sub-Micron</td>
</tr>
<tr>
<td>DSP</td>
<td>Digital Signal Processor</td>
</tr>
<tr>
<td>FA</td>
<td>Full Adder</td>
</tr>
<tr>
<td>IC</td>
<td>Integrated Circuit</td>
</tr>
<tr>
<td>IO</td>
<td>Input/Output</td>
</tr>
<tr>
<td>LSI</td>
<td>Large Scale Integration</td>
</tr>
<tr>
<td>MAC</td>
<td>Multiply-Accumulator</td>
</tr>
<tr>
<td>MOS</td>
<td>Metal Oxide Semiconductor</td>
</tr>
<tr>
<td>MOSFET</td>
<td>Metal Oxide Semiconductor Field-Effect Transistor</td>
</tr>
<tr>
<td>MUX</td>
<td>Multiplexer</td>
</tr>
<tr>
<td>NB</td>
<td>Normal Binary</td>
</tr>
<tr>
<td>NBE-RBPPG</td>
<td>Normal Binary Booth Encoding and Redundant Binary Partial Product Generation</td>
</tr>
<tr>
<td>NMOS</td>
<td>N-type Metal Oxide Semiconductor</td>
</tr>
<tr>
<td>PDA</td>
<td>Personal Digital Assistant</td>
</tr>
<tr>
<td>PDP</td>
<td>Power-Delay Product</td>
</tr>
<tr>
<td>PMOS</td>
<td>P-type Metal Oxide Semiconductor</td>
</tr>
<tr>
<td>PPG</td>
<td>Partial Product Generator</td>
</tr>
<tr>
<td>QoS</td>
<td>Quality of Silicon</td>
</tr>
<tr>
<td>Abbreviation</td>
<td>Description</td>
</tr>
<tr>
<td>--------------</td>
<td>----------------------------------</td>
</tr>
<tr>
<td>RB</td>
<td>Redundant Binary</td>
</tr>
<tr>
<td>RBA</td>
<td>Redundant Binary Adder</td>
</tr>
<tr>
<td>RBSD</td>
<td>Redundant Binary Signed Digit</td>
</tr>
<tr>
<td>RCA</td>
<td>Ripple Carry Adder</td>
</tr>
<tr>
<td>RFIC</td>
<td>Radio Frequency Integrated Circuit</td>
</tr>
<tr>
<td>RISC</td>
<td>Reduced Instruction Set Computer</td>
</tr>
<tr>
<td>SIMD</td>
<td>Single Instruction Multiple Data</td>
</tr>
<tr>
<td>SoC</td>
<td>System-on-Chip</td>
</tr>
<tr>
<td>TFA</td>
<td>Transmission Function Adder</td>
</tr>
<tr>
<td>TGA</td>
<td>Transmission Gate Adder</td>
</tr>
<tr>
<td>ULSI</td>
<td>Ultra-Large Scale Integration</td>
</tr>
<tr>
<td>VLSI</td>
<td>Very Large Scale Integration</td>
</tr>
<tr>
<td>XOR</td>
<td>Exclusive Or</td>
</tr>
<tr>
<td>XNOR</td>
<td>Exclusive Nor</td>
</tr>
</tbody>
</table>
Chapter 1

Introduction

1.1 Motivation

The Metal Oxide Semiconductor Field-Effect Transistor (MOSFET) was first proposed in 1930’s [YEO02]. Integrated circuits (ICs) were found to be much smaller and more power efficient than the discrete components used to build electronics system in 1950’s. Around 1980, the Metal Oxide Semiconductor (MOS) technology based integrated circuits became dominant in the market share compared with other technologies, such as bipolar technology. Until 1990, about two thirds of the IC industry sales were of the MOS technology, where more than 90% of them were based on Complementary MOS (CMOS) technology. Nowadays, the technology of IC contains a whole host of innovative devices and systems and performs a wide variety of tasks, whether visible or hidden, which greatly changes the way we live.

In 1960’s Gordon Moore, an industry pioneer, presciently presented the now well known Moore’s Law [FER89, WOL98], which predicted that the number of transistors in a single integrated circuit chip would double every 18 months. The prediction worked well in the following several decades. To date, the transistor count per chip doubles about once a year [WOL98]. The engineering profession prefers to scale the integrated technology by its orders of magnitude according to the integrated transistor count. Initially, when the circuits contained a few hundred transistors on a single chip, they were called “integrated circuits”. As the technology grows, a single chip can achieve an integration of thousands of transistors, which is known as large scale integration (LSI). When the chip grows beyond 10,000 transistors, the term very large scale integration...
Chapter I

Introduction

(VLSI) is used. The ultra-large scale integration (ULSI) era advents as a single chip can easily integrates 1,000,000 to 5,000,000 transistors together [FER89]. It is reported in 2000 that a single chip multiprocessor DSP contained 1.6 billion transistors [ACK00]. It is estimated that at the end of 2010's, by using 50nm transistor operating below one volt, a system-on-chip (SoC) will grow to 4 billion transistors [BEN02].

The conventional low-power circuit designers only set their focus on a few niche applications such as wrist watches, pocket calculators, electronic thermometers and some integrated sensors. Currently, the low-power design paradigm is expanding to the area of high performance applications because power dissipation tends to become one of the most important single design constraints. Regardless of how different the reasons the designers lower the power consumption for the targeted applications, minimizing the overall system power has emerged as the number one daunting task. As a matter of fact, the reasons can be zoomed in on two: portability and reliability [ELR97]. Many portable systems that withdraw power from the batteries are used daily, such as laptop computers, digital video cameras, cellular phones, etc. The requirement for low power stems from the needs of both reducing the weight and volume of the devices and lengthening the time to recharge. Reliability is another major concern for high-speed digital systems, which often generate a large amount of heat due to the high clock frequency and denser device integration on chips. Therefore expensive cooling systems are required to ensure the proper functionality of the system. The goal of low power design in this perspective is to lower the system cost and make it operate with higher stability.

With the rapid development of VLSI/ULSI technology where the minimum feature size of the current mature process is shrinking to well below 1μm, or deep sub-micron (DSM) process, the cross-coupling capacitance between adjacent lines of the same layer and that of different layers are becoming more and more important. It has increased from 40% to 70% of the total capacitance when the technology migrates from 0.35μm to 70nm. Besides, the interconnections also increasingly affect the chip power dissipation and create more potential cross talk problem [KHA01, FER89, ELR97].
1.2 Objective

The preliminary goal of this research is to investigate and characterize the high performance low power arithmetic circuits and macrocells, which are often the timing closure and power consumption stumbling blocks in digital systems. The techniques for low power design are explored from the device level such as the basic XOR, XNOR gates to the algorithmic level involving digital multiplication and scalar product computation for both the classical two’s complement number system and the unaccustomed redundant binary system. Apart from the CMOS circuit level optimization strategies with different logic styles, the development of new circuit structures and the derivation of ingenious architectures from novel algorithms, this research project also aims at providing a new insight into design tradeoffs that involve power, delay and area metrics and establishing a decision model that aids the designer in constructing an arithmetic circuit from its fabrics of favorable characteristics. It is envisaged that the proposed design methodologies, optimization methods and simulation environments are useful in advancing the domain knowledge in VLSI design to the DSM regime.

In a nutshell, the objective of research is to design new arithmetic circuits with the following Quality of Silicon (QoS):

a. low power and high energy efficiency (i.e., low power-delay product)
b. low voltage operation
c. high performance
d. high area efficiency

A complete layout of a sufficiently complex full custom arithmetic macrocell, which is designed based on new algorithm and composed of optimized novel primitive cells, is also targeted on 0.18μm CMOS technology so that more convincing postlayout simulations with back annotated parasitics can be used to validate our proposed design methodologies.
Chapter 1

Introduction

1.3 Research originality

The major contributions of the research results reported in this thesis can be categorized into the following three major areas: (1) circuit structure (2) algorithm and architecture, and (3) optimization, simulation and testing methods.

In the contributions towards new circuit structure, we have proposed a pass transistor logic design style gate used to efficiently generate the XOR and XNOR functions simultaneously and another gate in a novel complementary CMOS style circuit with regular layout to generate good drivability carry output. Using these two basic gates, we designed two novel 1-bit adder cells, Hybrid-1 and Hybrid-2. The simulation results demonstrate that the improved Module 1, i.e. the proposed XOR-XNOR gate, of Hybrid-2 cell features the most power-efficient cell among a number of current-art 1-bit adder cells over a wide range of supply voltages. The energy efficiency of the Hybrid-2 cell is most pronounced at sub-1V operation. The layout of the Hybrid-2 cell shows that it is also area efficient. As the main building blocks of multipliers, the structures of 4-2 and 5-2 compressors are analyzed and different CMOS logic style circuit implementations of their constituent modules are explored. A novel 5-2 compressor architecture of 4Δ delay is also proposed. The simulation results show that the 4-2 and 5-2 compressors constructed with the novel XOR* cell is able to function down to 0.6V, and features high speed and low power characteristics. Our proposed 5-2 compressor architecture outperforms all the other reported architectures over the range of voltages simulated, particularly when it is configured with the proposed circuits for the XOR* and the carry generator modules. Better performances against other architectures are also attained almost irrespective of the logic styles used for the circuit implementation of their constituent modules.

In the contributions towards new algorithm and architecture, we have formulated a new algorithm for the design of a VLSI circuit for scalar product evaluation. The algorithm leads to a novel full bit-parallel architecture of scalar product macrocell featuring low interconnect complexity, improved power efficiency and highly efficient VLSI area utilization. More importantly, the layout regularity and scalability enhance its
performance superiority in deep submicron regime well above conventional VLSI design of vector processing unit for scalar product computation. The overall performances of average power consumption, worst case delay and power efficiency of our post-layout circuit surplus even the pre-layout circuit of the conventional architecture when these circuits are simulated using Synopsys Nanosim over supply voltages from 0.7V to 3.3V based on Chartered CSM 0.18μm CMOS technology. The design consists of 226560 transistors with the core occupying an area of 1430μm x 1560μm. The full chip with peripheral circuits occupies an area of 2600μm x 2900μm. The postlayout simulation results of the IP core at 1.8V shows a worst case delay of 6.92ns and a power consumption of 65mW at 50 MHz data rate.

By exploiting the redundant binary (RB) system, another new redundant binary multiplier is proposed based on the newly developed Covalent Redundant Binary Booth Encoding (CRBBE) algorithm. This algorithm fully exploits the characteristics of the Booth encoded numbers to overcome the problem of generating hard multiples and achieves a compatible reduction of RB partial products without inducing any correction vector. Compared with the Redundant Binary Signed Digit (RBSD) multiplier, which is also used to address the hard multiple problem, the proposed CRBBE multiplier has only half the number of partial products to be summed if the same radix Booth encoders are used. Conversely, for the same number of RB partial products, the CRBBE encoder, say Booth-2 CRBBE, and its corresponding RB partial product generator (PPG) are much simpler than that of the RBSD multiplier, say Booth-4 RBSD encoder and RB PPG, which uses many high fan-in gates. The simulation results show that the CRBBE multiplier consumes lower power and computes faster than the normal binary Booth encoder, RB partial product generator (NBE-RBPPG) and RBSD multipliers.

In the contributions towards optimization, simulation and testing methods, we have proposed a systematic optimization procedure to size the transistor of all arithmetic cells in order to perform a fair comparison because the transistor sizing for optimal performance is technology dependent. The optimization procedure employs a greedy
Chapter 1

Introduction

sweeping strategy to search for the optimized transistor sizes in iterations, which has been proven to be a convergent algorithm.

During the simulation of full adder, a reasonably simple architecture is proposed to simulate the adder cell in an environment realistic to its actual deployment in the most frequently used parallel multiplier structure. This has overcome the bias in overestimating the performance of the full adders in low voltage operation which are often evaluated in isolation without concern on how they are deployed in actual circuit. Similarly, in order to realistically assess and compare the figures of merits of different configurations of 4-2 and 5-2 compressors at various supply voltages, new simulation environments for the compressors are established to ensure that the measured performances are still sustainable when these cells are integrated in an accumulation tree.

To test the scalar product macrocell, some auxiliary circuits including a unique delay detection circuit are proposed to enable accurate delay measurement of the core with the constraint of limited IO pins. Otherwise, if traditional testing method is employed, many more additional IO pins are required, which is impractical in our case.

The above contributions have led to the publications listed in the author's publications towards the end of the thesis.

1.4 Organization

Seven chapters are presented in the thesis. Chapter 1 expositis the motivation and objectives of our research. The results presented in this thesis are summarized into three categories of significant contributions.

Chapter 2 studies the causes of power consumption in CMOS integrated circuits and the related formulae, followed by a discussion on various methods to achieve low power consumption while sustaining the performance. The characteristic of different logic styles are also presented. This is followed by a detailed discussion on the features and the
Chapter 1

Introduction

shortcomings of the various circuit implementations of XOR and XNOR gates, which are the primitive elements used for the construction of full adders and compressors. A study of the different full adder circuits and a brief introduction to the use of high input compressors are provided, which serve as a preface to the more in-depth discussion of Chapters 3 and 4. The last part of this chapter outlines the history and evolution of normal binary and redundant binary digital multipliers, which are fundamentals to the research topics presented in Chapters 5 and 6.

Chapter 3 focuses on the design of full adder cells. It explores the existing full adder designs in different logic styles. Different circuit implementations of the three constituent modules of full adder are analyzed, before the proposed hybrid full adder cells are described. To optimize and compare the performance of the different full adders, a tree structured setup is proposed as the simulation environment. Also, a new transistor optimization procedure is described and the optimized sizing parameters at two different supply voltages for various adder cells being evaluated are provided. The circuits are simulated for power, delay and power-delay product performances and the results are analyzed and compared.

Chapter 4 investigates the performances of several fast 4-2 and 5-2 compressors. The 4-2 compressors constructed around the proposed XOR-XNOR cell exhibit superior power efficiency when they are compared with other configurations of the same architecture. Next, the 5-2 compressor architectures and their underlining building modules are scrutinized. A new fast 5-2 compressor architecture is proposed and its various configurations from different logic styles are investigated and compared against several well known 5-2 compressor architectures. Special setups for the simulation of 4-2 and 5-2 compressors are also described. These setups emulate a realistic application environment that truly reflects its actual operability in a tree-structured multiplier. Finally, the simulation results for various configurations of the proposed architectures and their contenders are presented and analyzed.
Chapter 1

Chapter 5 proposes the new algorithm for scalar product computation. Based on the algorithm, the architecture, the floor planning and the delay estimation of scalar product IP core are detailed. The same model of estimation is also applied to the conventional architecture for comparison. The layout in compliance with the CSM 0.18μm CMOS process technology of the proposed scalar product IP core equivalent to 16 full-width multiply-accumulation operations of 16-bit operands and the design of its auxiliary circuit are presented. A comparison of the pre- and post-layout simulation results is provided at the end of the chapter.

Chapter 6 presents our research on the novel redundant binary multiplier. It introduces the notion of RB systems before the structure of the RB multiplier is described in terms of its constituent modules. The existing algorithms and implementations of Booth encoders and partial product generators are explored. An ingenious covalent redundant binary Booth encoding algorithm is proposed. The circuit implementations for different radix encoders and accompanying partial product generators are also described. A 54×54-bit multiplier is constructed based on the proposed covalent redundant binary Booth encoder and RB partial product generator, together with the existing redundant binary adder (RBA) circuit for the RBA summing tree and RB-NB converter. The structure of the RB multiplier using the proposed algorithm is compared with the classic structures of other RB multipliers. Eventually, the simulation results are analyzed and discussed.

Finally, Chapter 7 highlights the virtues of the design methodologies used in the novel low power arithmetic cells and macrocells to conclude the thesis. The potential relevant topics for further research, design challenges and the role of the techniques discussed in the dissertation on advanced technologies are outlined.
Chapter 2

Literature Review

Since our project aims at the design of low power, high performance arithmetic circuits and their applications, a literature review on low power techniques and fundamentals of computer arithmetic circuits is inevitable. This chapter is organized as follows: Section 2.1 studies the causes of power consumption in CMOS integrated circuits and the related formulae, followed by a discussion on different methods to achieve low power consumption while sustaining the performance. The characteristic of different logic styles are also presented. Section 2.2 gives a detailed discussion of the features and the shortcomings of the various circuit implementations of XOR and XNOR gates, which are the primitive elements for full adder. A study of the different full adder circuits and a brief introduction to the use of high input compressors are provided in Section 2.3, which will serve as a preface to the more in depth discussions of Chapters 3 and 4. Section 2.4 narrates the history and evolution of normal binary and redundant binary digital multipliers, which are fundamentals for the research presented in Chapters 5 and 6.

2.1 Low power techniques for digital circuit design

The semiconductor industry has witnessed an explosive growth of the integration of sophisticated personalized devices and multimedia-based applications into mobile electronics gadetry since the last decade, such as notebook computers, cellular phones, digital cameras, digital video recorders, MP3 music player and personal digital assistants (PDAs). The improved functionality and increased versatility of these portable applications set a stringent requirement on the throughput and a tight constraint on the amount of power that can be dissipated due to the limited battery lifetime. Although new
high energy capacity types of rechargeable battery has been developed perpetually, the advancement of the battery technique is far slower than the escalating density of integrated circuits, where millions and millions of transistors are integrated into one chip, driving the semiconductor industry from VLSI era to ULSI era. As CMOS process technology shrinks, the unity gain cut-off frequency, $f_T$ of the transistors becomes comparable with that of the GaAs bipolar technology that it is now practical to design sub-1V radio frequency integrated circuits (RFICs) based solely on the matured low cost, low power CMOS process [TSA02]. The impact of this is that the front end wireless communication circuitries, traditionally based on analog circuit techniques, are now being transferred into the digital domain, which increases by leaps and bounds the rate of data to be processed by the arithmetic circuits in the digital signal processor.

Low power techniques are also important for high-performance processing systems, which may not be portable or battery powered, such as workstations, desktop personal computers, mainframe computers and other applications. Since they rely on ultra high clock speed and extreme chip integration density to deliver the level of performance to remain competitive, the amount of heat generated by the immense computing power is becoming daunting. Lowering the power consumption can reduce the cost associated with packaging, cooling and fans, which are costly to build, operate and maintain for complex cooling requirement. Another incentive for low-power design is on the issue of reliability. There is a well-defined relationship between the junction temperature, the performance and the reliability of an integrated circuit. For example, the delay increases by 14% when the junction temperature increases from 85°C to 125°C [GEO01]. The reliability of an integrated circuit degrades as an exponential function of junction temperature. Silicon interconnect fatigue, package related failure, electrical parameter shift, electromigration, junction fatigue are all sorts of failure mechanism that can be induced by high on-chip temperature and are difficult to manage [BEL95]. Therefore, reducing the power consumption plays a key role in enhancing integrated circuit reliability.
2.1.1 Overview of power consumption of CMOS circuit

The average power dissipation in digital CMOS circuits is composed of three main sources, which are the dynamic power or switching power consumption, the short-circuit power consumption, and the leakage power consumption, shown in (2.1). In some logic styles, there are continuous current paths between the power supply and the ground, such as bias circuits to provide the necessary operating point to other transistors. In this case, a fourth source, namely, static power is consumed. Since the circuits presented in this thesis use only those logic styles that have no static power dissipation, we will consider only the former three sources of power consumption [BEL95, ELR97].

\[ P_{\text{avg}} = P_{\text{dynamic}} + P_{\text{short-circuit}} + P_{\text{leakage}} \]  \hspace{1cm} (2.1)

Dynamic or switching power consumption represents the power dissipated during a switching event. In CMOS logic circuits, the dynamic power consumption arises when the voltage source and ground that are connected to the capacitive loads, \( C_{\text{load1}} \) and \( C_{\text{load2}} \), respectively, are charged and discharged through PMOS and NMOS transistors to finish an output voltage transition from logic "0" to "1" and "1" to "0", where logic "1" and "0" are often the supply "vdd" and system ground "gnd". The load \( C_{\text{load1}} \) exists between the output and PMOS substrates, such as the capacitance between the PMOS drain area and substrate of the driving stage, the capacitance between PMOS gate and substrate of the loading stage, and the capacitance between the output interconnecting lines and power supply lines. Similar situation exists for the load \( C_{\text{load2}} \). Fig. 2.1 shows the capacitive load model of an inverter for the calculation of dynamic power consumption.

![Capacitive load model](image)
For $C_{\text{load}2}$, when the output goes from logic “0” to “1”, the power supply charges through PMOS transistor $P_1$. Part of the energy is dissipated on PMOS and part of it is stored in $C_{\text{load}2}$. At the same time, the already charged $C_{\text{load}1}$ is discharged through $P_1$, releasing the energy previously stored in it. When the output returns from logic “1” to “0”, the charged $C_{\text{load}2}$ will discharge through the NMOS transistor $N_1$, dissipating its stored energy. The $C_{\text{load}1}$ will be charged again through $N_1$. Similarly, a part of the energy is stored in $C_{\text{load}1}$ waiting to be released in the next “0” to “1” transition, and the other part is dissipated in $N_1$ during $C_{\text{load}1}$ charging. Therefore in each complete cycle of transitions from “0” to “1” to “0”, the dissipated power is:

$$C_{\text{load}1}V_{\text{dd}}V + C_{\text{load}2}V_{\text{dd}}V = C_{\text{load}}V_{\text{dd}}V$$

where $C_{\text{load}} = C_{\text{load}1} + C_{\text{load}2}$, and $V$ is the voltage range of the partial transition for the general case.

In general, the number of transitions per unit time corresponds to the system clock frequency $f_{\text{clk}}$, but the node does not transit in all clock cycles. Therefore we define a transition probability, $\alpha$, that in each clock cycle a node will transit from “0” to “1” and then back to “0”. The dynamic power $P_{\text{dynamic}}$ for the node $i$ is given by

$$P_{\text{dynamic},i} = \alpha_i C_{\text{load},i}V_{\text{dd}}f_{\text{clk}}$$

For a complete circuit, the dynamic power $P_{\text{dynamic}}$ is:

$$P_{\text{dynamic}} = \left( \sum_{i=1}^{\text{# of nodes}} \alpha_i C_{\text{load},i}V_{\text{dd}} \right) f_{\text{clk}}$$

The short-circuit power consumption is caused when both the NMOS and the PMOS transistors in the circuit conduct simultaneously for a short amount of time during switching, forming a direct current path between the power supply and the ground. The short-circuit power is inevitable because the real input voltage waveforms always have finite rise and fall time. The short-circuit current, which passes through both the NMOS and PMOS devices during switching, does not contribute to the charging or discharging.
of the capacitances in the output node. The short-circuit power will be prevalent if the output load capacitance is small, and/or if the input signal rise and fall time is large.

Simply considering a symmetrical CMOS inverter with $k = k_n = k_p$ and $V_T = V_{T,n} = |V_{T,p}|$ for both transistors, and with a very small capacitive load. If the inverter is driven with an input voltage waveform with equal rise and fall time ($\tau = t_{\text{rise}} = t_{\text{fall}}$), we have the following time-averaged short circuit current provided by the power supply:

$$I_{\text{short-circuit}} = \frac{1}{12} \frac{k \tau f_{\text{clk}} (V_{dd} - 2V_T)^3}{V_{dd}}$$

Therefore, the short-circuit power dissipation is:

$$P_{\text{short-circuit}} = \frac{1}{12} k \tau f_{\text{clk}} (V_{dd} - 2V_T)^3$$

It is noted from (2.3b), which is for small capacitive load model, that the short-circuit power dissipation is linearly proportional to the input signal rise and fall time, as well as to the transconductance of the transistors. Hence, to reduce the short-circuit current, an effective way is to shorten the input transition time.

For an inverter with a large capacitive load, the output node voltage will retain its voltage before the input node finishes its transition. Although the currently turned on transistor has large voltage difference between its drain and source, the power dissipated is now contributed to the dynamic component because the other transistor has been turned off.

There are two types of leakage currents: reverse-bias diode leakage on the transistor drains and sub-threshold leakage through the channel of an off device. For the first type, the diode leakage occurs when one transistor is turned off but another transistor pulls up or down the drain of the former transistor, making it reversed biased with respect to the bulk potential. For example, when an inverter has a logic "1" input, the NMOS transistor is turned on pulling the drains of both PMOS and NMOS to low voltage. At same time,
the PMOS transistor is turned off with its drain-to-bulk voltage equal to \(-V_{dd}\). The resulting current is

\[
I_{\text{reverse}} = A_d J_S \left( \frac{V_{bias}}{kT} \right)^{-1} e^{-\frac{V_{bias}}{kT}} \tag{2.4}
\]

where \(V_{bias}\) is the reverse bias voltage magnitude on the PN junction, \(A_d\) is the area of the drain diffusion, and \(J_S\) is the reverse saturation current density, which is dependent on the technology and weakly dependent on the supply voltage. Usually the reverse-bias diode leakage current is a small fraction of the total power consumption in most chips. But it may be significant for a chip spending most of its time in standby operation, since this power is always being dissipated even when there is no switching activity.

The second type of the leakage power is the subthreshold leakage which occurs due to carrier diffusion between the source and the drain when the gate-source voltage, \(V_{gs}\), is between the weak inversion point and the threshold voltage \(V_t\), where carrier drift is dominant. The current in the subthreshold region is given by:

\[
I_{ds(\text{subthreshold})} = Ke^{-\frac{V_{gs}}{V_t}} \left( 1 - e^{-\frac{V_{ds}}{V_t}} \right) \tag{2.5}
\]

where \(K\) is a function of the technology, \(V_t\) is the thermal voltage \((kT/q)\) and \(V_t\) is the threshold voltage. For \(V_{ds} >> V_t\), \((1 - e^{-\frac{V_{ds}}{V_t}}) \approx 1\); that is, the drain to source leakage current is independent of the drain-source voltage \(V_{ds}\) for \(V_{ds}\) slightly larger than 0.1V.

### 2.1.2 Low power design techniques

Several methods can be applied to optimize the power dissipation of digital systems. These methods can be divided into five different levels, which are device or process level, circuit or logic level, architecture level, algorithm level, and system level. Device characteristics, device geometries and interconnect properties are the significant factors...
in lowering the power consumption. Circuit-level measures such as the proper choice of circuit design styles, reduction of the voltage swing and clocking strategies can be used to reduce power dissipation at the transistor level. Architecture-level measures include smart power management of various system blocks, utilization of pipelining and parallelism, and design of bus structures. Fig. 2.2 shows the hierarchical levels of optimization techniques.

![Power optimization hierarchy](image)

Figure 2.2 Power optimization hierarchy

The switching power dissipation in CMOS digital integrated circuits is a strong function of the power supply voltage, approximately quadratic to \( V_{dd} \). Therefore, reduction of \( V_{dd} \) is a lucid way to effectively reduce the power consumption. However, the savings in power dissipation comes with a significant increase in circuit delay. For a CMOS inverter, the propagation delay is given by:

\[
\tau = \frac{C_{load}}{k_n(V_{dd} - V_{T,n})} \left[ \frac{2V_{T,n}}{V_{dd} - V_{T,n}} + \ln \left( \frac{4(V_{dd} - V_{T,n})}{V_{dd}} - 1 \right) \right] \tag{2.6}
\]

The delay \( \tau \) is increased when the power supply voltage \( V_{dd} \) is decreased provided all other parameters remain constant. To compensate for the negative effect of lowering supply voltage on the circuit performance, it is a feasible means at device level to scale
down the threshold voltage of the transistor accordingly. However, at some point, the threshold voltage and supply voltage reduction is offset by an increase in the leakage currents, resulting in the existence of an optimal threshold voltage for a given level of logic complexity. In so doing, the effect of reducing noise margin must also be taken into consideration.

The circuit level optimization techniques involve optimizing place and route, transistor sizing, reduced swing logic, logic minimization, logic level power down and etc. As semiconductor technology advances, the effect of the parasitic capacitance of the interconnects on the latency, power consumption and VLSI area becomes non-trivial in submicron and deep submicron technologies [KAT00, SYL98]. As IC designs move into the system-on-chip era, the continuous quest for more functionality and increasing emphasis on dedicated computational-intensive intellectual property cores has resulted in aggressive downscaling of the transistors to provide higher levels of integration. This has the positive effect of reducing the internal gate delay, but at the same time, the parasitic capacitance of the interconnect wires increases commensurately. It is observed that, for a 0.35μm technology, the gate delay is about 100ps, local interconnection delay is about 150ps, and global interconnection is about 1000ps. If the technology advances to 0.07μm, the gate delay is decreased to 10ps, local interconnection delay is decreased to 50ps, whereas the global interconnection is increased to 6000ps. The coupling capacitance between adjacent lines separated by minimum spacing increases from 40% to 70% [KAT00]. Sylvester and Kuetzer [SYL98] reported that, for a 1mm metal-1 line, the RC delay in a 0.5μm technology was 15ps as compared to 340ps in a 0.1μm technology. In high performance, power-hungry VLSI and ULSI digital circuits, reducing the parasitic capacitance has emerged as a key design premise for the reduction of dynamic power dissipation, which accounts for more than 90% of the total power in a CMOS device. The parasitic capacitance can be reduced by using fewer and smaller devices as well as fewer and shorter interconnects. Glitches and spurious transitions can also be minimized by equalizing the delay paths to the gate inputs [NAN99]. Therefore, at the layout level, the place and route should be optimized such that signals that have high switching activity
(such as clocks) should be assigned short wires and signals with lower switching activity are allowed to use relatively longer wires.

Also in circuit level, optimized transistor sizing plays a vital role in reducing power consumption. To achieve low power, it is important to equalize all delay paths so that a single critical path does not unnecessarily limit the performance of the entire circuit. As voltage varies, the optimal sizing of the circuits for low power operation is quite different from that for high speed applications.

The logic style helps to optimize the circuit in low power operations [BLA94, CHA92]. Many CMOS structures have been used to implement logic functions. Dynamic logic circuits and static logic circuits are two main categories of the CMOS logic circuits [VAI01, MUR96, QUI01, PIG95, ZIM97]. The static CMOS logic circuits are further divided into complementary CMOS and pass-transistor logic circuits.

Dynamic CMOS circuits are clocked and operated in two phases, a precharge phase and an evaluation phase. Despite being capable of fast evaluation with reduced number of transistors compared to its static CMOS counterpart, its large clock loads and high switching activities due to the precharge mechanism make dynamic CMOS circuits consume more power [ZIM97]. When both the precharge PMOS and discharge NMOS logic circuits are in the off state, the dynamic circuit has floating nodes which are vulnerable to alpha particle attacks. The operation margin of a dynamic circuit narrows rapidly with reduction of supply voltage due to its noise susceptibility. Security devices are usually included to guide against noise corrupting the data on the floating nodes, but their inclusion causes considerable delay time penalty [MUR96].

Complementary (conventional) CMOS circuits are made of a PMOS pull-up network and a NMOS pull-down network controlled by the same set of inputs feeding to the gates of the PMOS/NMOS pairs [VAI01, ZIM97]. The series transistors in the output stage form a weak output driver. This could be corrected by additional output buffers/inverters. The advantage of the Complementary CMOS is its robustness against voltage scaling and...
transistor sizing, which are essential to provide reliable operation at low voltage and arbitrary transistor sizes. Moreover, layout of the Complementary CMOS circuit is straightforward and area-efficient due to the complementary transistor pairs and smaller number of interconnecting wires.

The basic difference between the pass-transistor logic [ASS91] and the Complementary CMOS logic styles is that the source side of the pass logic transistor network is connected to some input signals instead of the power lines [VAI01, ZIM97]. The advantage is that one pass-transistor network (either PMOS or NMOS) is sufficient to implement the logic function, which results in smaller number of transistors and smaller input load. However, pass-transistor logic has an inherent threshold voltage drop problem. The output is a weak logic "1" when "1" is passed through a NMOS and is a weak logic "0" when "0" is passed through a PMOS. Unlike Complementary CMOS, pass-transistor logic is sensitive to voltage scaling and transistor sizing, thus limiting its robustness. In other words, efficiency and reliable operation of logic gates are not necessarily guaranteed at low voltage or reduced transistors' sizes [RAB96]. Thus, transistor sizing is crucial for correct gate operation and the optimization process is less intuitive. Usually, the layout of pass-transistor logic circuit is not as straightforward, as it requires more space to segregate the different diffusion areas, which increases the routing complexity of the interconnecting lines. Consequently, it has poor area efficiency due to irregular transistor arrangements and complex wirings.

Transmission gate logic circuit is a special kind of pass-transistor logic circuit [VAI01, WES93]. It is built by connecting a PMOS transistor and a NMOS transistor in parallel, which are controlled by complementary control signals. Both the PMOS and NMOS transistors will provide the path to the input logic "1" or "0", respectively when they are turned on simultaneously. Thus, there is no voltage drop problem whether the 1 or 0 is passed through it. The main disadvantage of transmission gate logic is that it requires double the number of transistors or more transistors than the standard pass-transistor logic to implement the same circuit. Chapters 3 and 4 of this thesis will further
demonstrate how the circuits of basic gates and arithmetic cells are optimized in the circuit/logic level.

There is a great deal of freedom in optimizing architectures for low-power including power shutting down of the unused blocks, using parallel processing blocks and pipelining techniques. For example, by duplicating the operating unit, the speed requirement is decreased by a factor of 2, which means the power supply voltage can be lowered while maintaining the required throughput. The total capacitance is increased by a factor of about 2. The lowered power dissipation is mainly brought by the lowering of the supply voltage. Similar situation applies on the pipelined circuits. But the two obvious consequences of this approach are the greater area and the higher latency.

In the algorithmic approach, power consumption is effectively minimized by decreasing the number of operations to decrease the hardware usage and the required operational speed. A good algorithm also makes the hardware design straightforward and regular.

In the system level optimization, the clocks used system wide can be set to a low frequency, while the on-chip phase locked loop provides the circuit with the required high frequency clock. High level integration of off-chip peripheral integrated circuits into one single chip is also a viable alternative, such as the System-on-Chip (SOC).

Chapters 5 and 6 will demonstrate, by means of practical arithmetic applications, the low power optimization approaches at architectural and algorithmic levels.

2.2 Design of XOR and XNOR cells

When the power supply voltage is lowered to decrease the power dissipation, various problems due to the low voltage arise, such as driving capability weakening, logic failure due to output voltage drop in some logic styles, noise margin reduction and so on. Therefore, the design of the basic gates needs to be carefully tailored to ensure that they will be functioning in low voltage low power applications.
The exclusive or (XOR) and exclusive nor (XNOR) gates are the most frequently used fundamental units in a variety of digital circuits, such as half adder, full adder, comparator, parity checker, filters, multipliers and so forth. The functions of XOR and XNOR gates are shown in (2.7) and (2.8) respectively:

$$A \oplus B = \overline{AB} + \overline{A}B$$
$$A \odot B = AB + \overline{AB}$$

(2.7)
(2.8)

Many circuits of XOR and XNOR are proposed in the last decade [BUI00(1), BUI00(2), BUI02, CHE99, FAN96, LEE97(2), RAD01, VES99(1), VES99(2), WAN94, YU00]. Fig. 2.3 shows the classical CMOS design of XOR circuit, which implements the Boolean function (2.9). This design uses many transistors, occupies comparatively more silicon area and the delay is longer.

$$A \oplus B = \overline{AB}(A + B)$$
$$= (\overline{A} + B)(A + B) = \overline{AB} + \overline{A}B$$

(2.9)

Figure 2.3 Classical CMOS XOR-XNOR gate

The mirror CMOS styled circuits are shown in Fig. 2.4. It is simpler than the classical CMOS version, but it needs complementary signals for all its inputs, which implies two additional inverters are necessary to generate them.
Wang et al. [WAN94] proposed conceivably the most frequently used XOR gate and XNOR gate designs. The goal was to use non-complementary inputs to generate the needed functions. These designs are so important and fundamental that many other designs of XOR and XNOR gates are improved and optimized based on them. The cross back PMOS transistors provide a strong "1" output when the inputs are "01" or "10". When input with "11", the two series NMOS transistors pull down the output with strong signal while the cross back PMOS's are turned off. The only exception is when the input is "00", a weak "0" is passed to the output by the PMOS's. A threshold voltage drop problem is encountered when the circuit is used in low voltage applications. The XNOR circuit encounters the same problem with the input of "11". This problem gives room for further improvement by other researchers.

![Figure 2.4 Mirror CMOS XOR and XNOR gates](image)

![Figure 2.5 Wang's Cross-back XOR and XNOR gates](image)
Fig. 2.6 shows one modified version of Wang’s cross back XOR and XNOR gates. One NMOS and one PMOS transistors are added to the previous XOR and XNOR gates, respectively, to address the weak logic problems. When the input “00” is exerted on the XOR gate of Fig. 2.6(a), the added NMOS turns on to pass strong logic “0” to the output. Now they are able to work at very low voltage. The price to solve this problem is the introduction of one additional complementary input signal.

![Diagram of XOR and XNOR gates](image)

**Figure 2.6 Modified Cross-back XOR and XNOR gates**

Fig. 2.7 shows the CMOS+ XOR and XNOR gates, each of which is composed of one transmission gate and two pass transistors. In the XOR circuit design, when input B is “0”, the transmission gate is turned on with the help of the complementary input of B, and input A is passed on to the output to deliver the strong logic. When the input B is “1”, the transmission gate is turned off, and the two pass transistors function as an inverter. So the inverted signal of input A appears on the output. In any case, the output always has strong logic.
Fig. 2.8 shows a similar but modified design of CMOS+ circuits in Fig. 2.7. The two pass transistors are replaced by a gated inverter, whose control terminals are connected to the complementary inputs.

If the XOR and XNOR gates are implemented in CPL logic style, simpler circuits are obtained in Fig. 2.9, where only two transistors are used to realize the functions. Unfortunately, they suffer from the weak logic problems when the output is “1”. Nevertheless, they are fundamental to the construction of other enhanced circuits.
To overcome the non-full swing problem, a weak feedback latch, which consists of one inverter and one PMOS transistor for feedback shown in Fig. 2.10, is inserted between the output and the CPL circuit. With the latch, the output will always provide the strong signals and is suitable for cascading. However, the circuits cannot work at very low voltage because the latch may be driven by a weak logic signal at the beginning of some input transitions. When the supply voltage is sufficiently low, the weak signal does not have enough drivability to turn on the latch.

The DPL style XOR and XNOR gates, as shown in Fig. 2.11, eliminates the weak logic by adding the corresponding PMOS counterparts to each NMOS transistors. Although each input needs a pair of complementary signals, the preceding stage that drive the inputs of these circuits experiences balance load because each input signal of these circuits is connected to one transistor’s gate and another transistor’s source.
Chapter 2 Literature Review

The transmission gate XOR and XNOR, as shown in Fig. 2.12, are very similar to the DPL style, except that their input loads are not as balanced as those of the DPL circuits.

Fig. 2.13 shows yet another pass transistor style circuit. One inverter is used to generate the complementary signal of one input. The two pass transistors function either as another inverter to generate the strong output or as two separate pass transistors with one of them passes the weak signal. Therefore, they are also not suitable for low voltage applications.
A modified pass transistor XOR and XNOR gates, named as powerless-XOR and groundless-XNOR, are shown in Fig. 2.14. The difference between the circuits of Fig. 2.14 and those of Fig. 2.13 is that the original power connected PMOS transistor in XOR and the original ground connected NMOS transistor in XNOR are now connected to one of the input signals. Therefore some power is saved compared with the circuits of Fig. 2.13, but the weak logic problem still exists.

For brevity, we only show the functional part of each XOR and XNOR designs from Fig. 2.4 to 2.14, without adding inverters or buffers to the outputs. But many of the designs will suffer from inadequate driving capability even if they can output strong logics, such as DPL style, transmission gate style, etc. To improve the drivability, it is a common practice to add inverter in the design, such as inverter connected XNOR to realize XOR with improved driving capability, or vice versa, as shown in Fig. 2.15. The circuits with weak logic outputs can also be enhanced to some extent by adding output inverters.
In some circuits, it is not enough to have independent XOR or XNOR gate with single output, while the complementary outputs of both XOR and XNOR are preferred. The methods in Fig. 2.15 using inverters to generate complementary outputs also fit these requirements. But, the outputs are not balanced in terms of timing and/or driving capability. Therefore, several dedicated XOR-XNOR gates are designed. One of them is the modified CPL style, which is shown in Fig. 2.16. The previously presented CPL XOR and XNOR gates are connected together by a pair of feedback PMOS transistors, which restores both outputs to strong signals. This is because for any input combination, at least one output is strong and drives the other output through the feedback PMOS transistors to give the strong signal. The drawback is that complementary pairs of inputs are still needed.

![Diagram](image-url)
Another solution is to combine Wang's XOR and XNOR circuits together and logically optimize it to produce the circuit shown in Fig. 2.17. However, this design does not perform well at low voltage, especially during the transition from "01" to "00" or from "10" to "11". A detailed analysis is presented in Chapter 3.

![Figure 2.17 Feedback XOR-XNOR gate](image)

One solution of the problem is to use Cheng's XOR-XNOR gate [CHE99], where two pass transistors and an inverter are added. When the input transits from "01" to "00" or from "10" to "11", one of the added pass transistors will turn on and a strong logic (i.e., strong "0" through NMOS or strong "1" through PMOS) is passed to drive the feedback transistors.

![Figure 2.18 Cheng's XOR-XNOR gate](image)
Since a full adder will add three bits, the sum output of the result is an XOR of these input bits or an XNOR of them, as shown in (2.10)

\[ \text{sum} = a \oplus b \oplus c = a \bigoplus b \bigoplus c \]  

(2.10)

Some papers proposed 3-input XOR gates for the sum generation. Several designs are indeed the cascade of the XOR or XNOR gates presented earlier. There are two dedicated designs of 3-input XOR gates. One is the mirror CMOS style, which is shown in Fig. 2.19. Similar to the XOR gate, every input requires a complementary signal pair. The design uses 20 transistors without counting the number of transistors for the inverters.

\[ a - \bar{a} - \bar{a} - a \]
\[ b - b - b - b \]
\[ \bar{c} - \bar{c} - \bar{c} - \bar{c} \]
\[ a \text{ XOR } b \text{ XOR } c \]

Figure 2.19 Mirror CMOS 3-input XOR gate

Another design is proposed by Fang [FAN96]. Two pairs of PMOS cross back transistors and two pairs of NMOS cross back transistors are used. However, the circuit has weak logic problem on both its internal nodes and the output nodes. It fails to work at low voltage, especially when the three inputs are “000” or “111”.

29
Due to the unsatisfactory performances of the existing circuits of 3-input XOR gate, they are not popular in arithmetic circuit design.

2.3 Design of full adders

One bit full adder is another fundamental arithmetic unit widely used for addition, multiplication, digital filtering operations in microprocessors and digital signal processors. From its function equations, full adder is composed of exclusive-or circuits and carry generation circuits. A variety of different designs of full adders have been reported [BUI00(2), BUI02, CHE99, FAN96, KO95, LU01, QUI01, RAD01, RAD99, SAY02, SHA97, SHA98, SHA99, SHA00, SHA02, VES99(1), WAN94, WEY02, YU00, ZHA03, ZHU92, ZIM97].

Fig. 2.21 shows the classical CMOS style full adder. It uses two parts of the complex gates to generate the carry and sum signals. Each part is the implementation of the basic Boolean equations of the full adder function, where the sum generation utilizes the logic of the carry generation circuit.
Fig. 2.22 is the mirror CMOS circuit, which is an optimized version of the classical CMOS circuit. The PMOS transistors block mirrors the NMOS transistors block. Both the classical and mirror CMOS full adder circuit are of 28 transistors each and each of their inputs is made of non-differential logic, which is preferable in many applications.

Fig. 2.23 shows a CMOS full adder with fast carry, where there is one XOR gate delay from the carry input to the sum output and two NAND gates delay from the carry input to the carry output.
A modified CPL style full adder is shown in Fig. 2.24, which uses two independent circuits with a total of 32 transistors to realize the generation of sum and carry. Since all input and output signals are differential, the modified CPL full adders can be cascaded, such as in the tree structure multiplier, without the need of any additional inverters to generate the complementary signals.

Fig. 2.25 shows the 1-bit full adder of latched CPL style using 22 transistors, which is a direct latched CPL translation of the modified CPL full adder. As opposed to the modified CPL style, the input of latched CPL full adder is differential while the output is non-differential. Besides, although the outputs are strengthened by the weak feedback
latches, several internal nodes of the latched CPL style full adder still suffer from the weak logic problem.

![Latched CPL full adder diagram](image)

**Figure 2.25 Latched CPL full adder**

A large amount of recently published full adders have a three-module structure shown in Fig. 2.26, which is analyzed by Shams and Bayoumi [SHA97]. Module 1 in the structure generates the internal complementary signals, which are the exclusive-or and exclusive-nor of two of the inputs, usually inputs \( a \) and \( b \). Based on the two internal signals, the design of the sum generator in Module 2 and the carry generator in Module 3 are simplified and optimized. Therefore, many authors proposed new circuits for each of these modules and combined them together to form new full adders. Some typical cases are discussed in Chapter 3.
In multipliers, full adders are used as 3-2 counters to compress the partial products in the carry save adder tree. In order to achieve more regular structure and lower latency of the partial product accumulation stage in the multiplier, 4-2 and 5-2 compressors have been widely employed nowadays for high speed multipliers. Owing to its regular interconnection, 4-2 compressor is ideal for the construction of regularly structured Wallace tree with low complexity [RAD00, WAN95]. Several 4-2 compressor circuits have been proposed for low power applications [GU03(1), HSI98, MAR99, MEH91, PRA01, RAD00]. Some of them are able to operate at low supply voltages but require excessive number of transistors due to their complementary CMOS structures, others use smaller number of transistors but fail to function at ultra low voltage, or lack the driving capability to drive the next level of subcircuits. Higher input compressors have also been studied by researchers [GU03(2), KWO00, MEH91, PRA01] and fast 5-2 compressors have been increasingly employed in large word-size multipliers and high precision multiply-accumulators [HSI98, KWO00, RAD00]. Analogous to the analysis of full adders, the 4-2 and 5-2 compressors can also be decomposed into several modules [GU03(2), KWO00, PRA01]. The detailed structures and performance will be studied in Chapter 4.
2.4 Digital multiplier architectures

2.4.1 Normal binary multipliers

Multipliers are widely used in many areas. Digital filters and multimedia signal processing applications often involve a large amount of multiplications. In different types of microprocessors, multiplier is an indispensable core in the arithmetic logic unit (ALU) and floating point co-processors. Two basic operations are involved in digital multiplier, which are the generation of partial products and their accumulation. There are two ways to speed up multiplication. One is to reduce the number of partial products; the other is to accelerate the accumulation. Obviously, a smaller number of partial products also reduce the complexity and computation time to accumulate the partial products. The multiplication can be considered as a series of repeated addition. So, multipliers are more expensive, operate relatively more slowly and consume more power than adder or subtractor.

The multiplier architectures can be broadly classified into two categories: serial and parallel, as shown in Fig. 2.27. The chip area, computational speed and power dissipation are critical factors for multiplier design [ABU96]. As both operands are input to the multiplier serially, the circuitry of a serial multiplier [PAR00, PEK03, LU95, CHA97] is small. Hence, the chip area, the hardware cost as well as the power consumption can be minimized. However, serial multiplier is much slower than parallel multipliers. Pipelining is a way to increase the speed of serial multiplier [CHO02, FAN00].
In parallel multipliers [MEH91, WAL64, WAN95], both operands are input to the multiplier in a parallel manner. The circuitry uses larger area and is more complex than the serial multipliers. There are two types of parallel multipliers: array multiplier [PAR00, RAB01, BRA63, BAU73] and tree multiplier [DAD65, WAL64]. Array multipliers such as Braun multiplier [BRA63] and Baugh Wooley multiplier [BAU73] have regular layout, whereas tree multipliers such as Dadda [DAD65] and Wallace [WAL64] multiplier have higher speed. These multipliers are widely used in reduced instruction set computer (RISC) CPUs, DSPs and graphics accelerators.

Another hybrid type of multiplier is called serial-parallel multiplier [PAR00], which is used to trade between hardware simplification and computing speed. In this architecture,
Chapter 2

Literature Review

one operand is entered serially and the other is stored in parallel with a fixed number of bits.

The Braun multiplier, invented by Braun Edward Louis in 1963 [BRA63], is a relatively simple form of parallel multipliers. It is a direct implementation of the paper-and-pencil method as how we would perform the multiplication by hand. Braun multiplier is also commonly known as the carry save array (CSA) multiplier. This multiplier is well suited for multiplying two unsigned numbers. The iterative structure consists of an array of AND gates and adders without any sequential logic or registers. The regular layout makes it ideal for VLSI and ASIC realization.

The Baugh-Wooley Multiplier was designed by Bruce A. Wooley and Charles R. Baugh in 1973 [BAU73]. This multiplier is actually an improved version of the Braun Multiplier as the hardware structures of both multipliers are very similar. However, Baugh-Wooley multiplier is able to operate with both the unsigned and signed numbers. It is conjectured that the invention of Baugh-Wooley multiplier has contributed significantly to the advent of computer arithmetic, because it is the first fast multiplier capable of performing both unsigned and signed multiplications. Although Baugh-Wooley multiplier is time consuming and less efficient when dealing with large operands, it is nonetheless a good candidature even as of today when the operands are less than 16 bits.

The first tree multiplier was introduced in 1964 by Wallace [WAL64]. He suggested a notion of a carry-save adder (CSA) tree as a way to efficiently and progressively reduce the multi-operand additions in the multiplication process to a final stage of two operand addition. The Wallace tree multiplier employs full and half adders to add up the partial products simultaneously in a parallel sequence. Later, Dadda [DAD65] suggested an optimal compression scheme using different size counters (mainly 3-2 counters, which are full adders and 2-2 counters, which are half adders) and showed that different schemes of cell allocation, including the one introduced by Wallace require different number of cells (counters). A natural VLSI layout of either Wallace or Dadda architecture is to distribute the cells such that the lengths of the interconnections are as
short as possible. However, as the sum and carry signals need to be communicated to non-adjacent cells and propagate downwards across non-adjacent stages, the wiring and layout are irregular and more complicated than the array multipliers, and the long wiring interconnections have the potential of introducing crosstalk and markedly reduce the performance under DSM process technology [LEE97(1), KAT00, SYL98, WAL00].

As early as 1951, A. D. Booth [BOO51] introduced the Booth multiplier. The Booth algorithm provides a simple way to generate the product of two signed binary numbers by means of the Radix-2 arithmetic. The drawback of Booth's algorithm is that it becomes inefficient when there is a great number of isolated 1's in the operands. In 1961, MacSoreley [MAC61] proposed the modified Booth algorithm, which is also known as Radix-4 Booth encoding. Since the modified Booth algorithm is capable of reducing the number of partial products by half, it makes efficient hardware implementation of digital multipliers with reduced logic depth and logic complexity. Since its introduction, the modified Booth algorithm has soon evolved to become a ubiquitous algorithm in prevailing high-speed multipliers, especially for those that have to operate with large operands.

Most hardware architectures take advantage of the two basic operations in the multiplication algorithm, i.e., the generation of the partial products and the accumulation of these partial products. Thus, there are two lucid ways to speed up multiplication. One is to reduce the number of partial products [OKL96]; the other is to accelerate the accumulation process by minimizing its latency [PAR97, SRI92]. High radix Booth algorithm has the advantage of reducing the number of partial products to be added, while the Wallace/Dadda approach speeds up the addition of the partial products. When these two techniques are combined in a hybrid fashion, they can yield a multiplier that is much faster than the traditional Wallace/Dadda or Booth multiplier [MIL92, KHO99]. Today, this method is commonly used to realize high-speed multiplier because theoretically, it is the fastest solution. However, due to the complicated inter and intra stage interconnections, the VLSI layout of the adder tree has been irregular and obscured by the mingling cell connectivity, leading to highly inefficient utilization of silicon area.
Furthermore, as the size of the multiplier increases, the interconnection problem becomes exponentially complicated and difficult to optimize.

To avoid the cross-stage interconnection, one method is to use the 4-2 compressors [GU03(1), MAR99, RAD00, WEI81] instead of the conventional full adders (also known as 3-2 counters) to reduce the partial products. This approach is first introduced by Weinberger [WEI81] in 1981 as a means to speedup the column compressions of the dot matrix representation of the adder tree in parallel multipliers since a 4-2 compressor can reduce four inputs of the same weight to two. It produces a much more regular structure than the design that based on the 3-2 counter. This has the effect of drastically simplifying the interconnection because the partial products are added up in the form of a binary tree without having the wires leap frog across cells of non-adjacent stages. However, the avoidance of cross-stage interconnection by 4-2 compressors is achieved at the expense of long lateral communication wiring within each stage. Song and Micheli developed the 6-2 and 9-2 high input compressor families, which are built on 4-2 compressors and 3-2 counters [SON91]. Although the speed of multiplication can be increased by employing higher order counters or compressors to reduce the number of adder tree stages, there is an optimal number of partial products (depending on the operand width and the radix of Booth encoding used) below which the complexity and latency of the high order counter and compressor cells offset the benefits they provided [BON96, GU03(2), PRA01, SON91]. It should be noted that the height and width of the silicon area, on first order estimation, depends on the number of stages and the width of the stage with the largest number of cells. High input compressors are wider and larger than the simple 3-2 counter, the maximum length of wiring for multipliers built on high input compressors may exceed the width ratios for optimal area efficiency [WAN95].

2.4.2 Redundant binary multipliers

To seek for other alternatives of carry propagation free accumulation, the redundant binary (RB) representation, which is one of the signed digit representations, was firstly
introduced by Avizienis [AVI61] in 1961. Takagi [TAK85] proposed to apply this new arithmetic for fast multiplication and Edamatsu [EDA88] implemented it in VLSI. For the RB partial product accumulation, its physical design is regular and simpler than that of the Wallace tree multiplier due to its good repeatability and elimination of the cross-stage interconnections of differently weighted digits. In view of the non-trivial effect of the parasitic capacitance of the interconnects on the latency, power consumption, VLSI area and crosstalk problem in deep submicron devices [SYL98, KAT00], there has been a rekindle of interests in RB multipliers [SHI97] owing to the fine granularity of the RBAs and its carry free addition property [HAR87, MAK96, SAK00, BES02(1), BES02(2)].

Unfortunately, there have been a number of new RB multiplier proposals which claimed to have superior performance, were later proved to be incorrect or misleading in the fundamental concept. In the Radix-64 Booth encoded RB multiplier proposed by [LEE02], the authors justified that Radix-64 Booth encoding was the optimum radix because the result of their proposed RB multiplier shows an excellent performance in comparison to other published results based on different radix Booth encoders. However, these comparisons have not been carried out compatibly under the same process technology. It is obvious that in their scheme the critical delay of the Booth Encoding and the Partial Product Generation (BEPPG) stages contribute to almost 41% of the total delay time, which is much higher than the 26% as reported in [MAK96]. Therefore, in terms of the transistor counts, their scheme is the minimum. As for Radix-64, the claim that this is the optimum choice in Booth encoding is definitely not sustainable. In [KIM01] a carry-free equivalent bit conversion algorithm was proposed for the RB-to-NB conversion in an attempt to eliminate the final carry propagate adder. However, due to a flaw in the truth table for the carry-free RB-to-NB conversion algorithm, a carry chain in the conversion stage was erroneously neglected [KIM03]. The errors have been detected by several other researchers [ERC03, RUL03] and it was proven that carry propagation is ineluctable in any multiplication process [RUL03]. As a matter of fact, for most RB multiplier, the critical path includes the RB-to-NB conversion. In [CHO01], a direct-conversion scheme was also proposed without any carry propagation to minimize this critical path for parallel architectures. Despite the latency of this converter is a
constant independent of word size, carry propagation have been re-introduced into the revised adding rule. Therefore, the declaration has been misleading as the original carry-free addition property was completely abolished in the RBA tree of this multiplier.

We can safely conclude from the above experience that the parallel transformation from any redundant number representation to binary number without incurring some degree of carry propagation is impossible. Some mechanism has to be catered for the carry ripple. It is a matter of trade-off and efficiency. In RB multiplier, it is preferred to maintain the carry-free addition property in the RBA tree instead of annihilating this property by changing the adding rules to improve the reverse conversion efficiency. With the carry-free RBA tree, the carry propagation will inevitably be imposed on the final RB-to-NB conversion stage to some extent. New RB multiplier architecture, particularly one based on novel RB Booth encoders and partial product generators, is one of our research topics detailed in Chapter 6.
Chapter 3

Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells

3.1 Introduction

Adder is an elementary unit of DSP. It is used in ALU, floating-point unit, digital filters, linear transformations, correlators, etc. Full and half adder cells are the cores dominating the critical paths of complex arithmetic operations like subtraction, multiplication, division, exponentiation, etc [CHE94, KO95, KWO00, NAN99, REN93, SHA02, SON91]. Thus, their performance and complexity at transistor level directly influence the overall performance of the system. Due to the pervasive use of portable electronic devices, and the cost of current packaging technologies, power dissipation has become a vital design criterion. Very often, the optimization tasks are restricted by how best the arithmetic operators are implemented in the cell library provided to the designer for the synthesis. The total power consumption of the system will decrease when power efficient adder cells are used in the critical path of compound arithmetic circuits.

In this chapter, two new low voltage low power full adder cells are proposed. Both cells employ the same complementary CMOS based carry out generation circuit coupled with a chosen sum generation circuit. These fabrics of hybrid logic design styles feature not only the full swing and balanced outputs but also strong output drivability. The increase in area due to higher transistor count of complementary CMOS output stage is largely compensated by its area efficient layout. To enable the adder to operate at ultra-low...
supply voltage, the pass logic circuit that co-generates the intermediate XOR and XNOR outputs has been improved in the second adder cell to overcome the threshold voltage drop problem. A cascaded circuit structure is introduced to simulate the proposed 1-bit full adder cells and several other recently reported designs operating in a realistic application environment. Simulation results based on the TSMC 0.18\( \mu \text{m} \) CMOS process parameters show that some survival cells in stand alone operation may still fail when cascaded in a larger circuit, either due to the lack of drivability or unsatisfactory speed of operation. Our proposed full adder cells are proven to be power-efficient through rigorous simulations.

The role of full adders in computer arithmetic can be classified into two main categories. One category involves the chain structured applications [ALI02, CHA95], such as the ripple carry adders (RCA). In these applications, the critical path often traverses from the carry-in to the carry-out of the full adders. It is demanded that the generation of the carry-out signal be fast. Otherwise, the slower carry-out generation will not only extend the worst case delay, but also create more glitches in the later stages, hence dissipate more power. The other category involves the tree structured applications, which is frequently used in Wallace tree multipliers, array multipliers and multiplierless digital filters [CHA95, SON91, WAN95]. Full adders in these applications form a tree of several layers to compress the partial products to one stored carry format number before a final carry propagation adder converts it to a normal two's complement number. For high speed and low power operation, it is required that the outputs, i.e., the sum and carry-out of the full adder be generated simultaneously to minimize the glitches in the lower stages. The published literatures on full adder pay no attention to the specific requirements and the stringent drivability of the latter application. In this chapter, we target the tree structured application for the evaluation of full adders with the optimization and simulation pursued in the proposed tree structure simulation environment. Two new adder cells composing of hybrid CMOS logic styles in its constituent modules are proposed for use in the applications of the second category. For a fair comparison with other full adder cells, a systematic and unified approach to size the transistors of different full adder cells for optimal power-delay product performance is also suggested.
This chapter is organized as follows: Section 3.2 explores the existing full adder designs in different logic styles. In Section 3.3, different circuits for three constituent modules of the full adder are analyzed. It is followed by the description of the proposed hybrid full adder cells. To optimize and compare the performance of different full adders, a tree structure setup is proposed for the simulation environment in Section 3.4. Also, a new transistor optimization procedure is described and the optimized sizing parameters at two different supply voltages for various adder cells being evaluated are provided. The circuits are simulated for power, delay and power-delay product performances and the results are analyzed and compared. Section 3.5 summarizes and concludes the work presented in this chapter.

3.2 Review of existing full adder cells

A full adder has three inputs and two outputs shown in Fig. 3.1. The three inputs, \(a, b, c\), and one of the outputs, \(sum\), are of the same weight, while the other output \(carry\) weighs one binary bit higher than the others. Therefore, their relationship is shown in (3.1):

\[
a + b + c = sum + 2 \cdot carry
\]  

\(3.1\)

\[\begin{array}{c}
a \\
b \\
c \\
\hline
\text{full adder} \\
\text{carry} \\
\text{sum}
\end{array}\]

Figure 3.1 Block diagram of full adder

The following equations are most frequently utilized to generate the outputs.

\[
sum = a \oplus b \oplus c
\]  

\(3.2\)

\[
carry = ab + bc + ac
\]

\[
= ab + (a \oplus b)c
\]  

\(3.3\)
Chapter 3   

Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells

Fig. 3.2 shows the low power 1-bit adder cells [RAD01, SHA99, SHA02, ZIM97] implemented in several variants of static CMOS logic styles. The classical CMOS full adder (C-CMOS) consisting of 28 transistors (henceforth ‘n transistors’ is abbreviated as nT) is shown in Fig. 3.2(a). It is based on the mirrored CMOS structure with PMOS pull-up and NMOS pull-down transistors. The buffers at the last stage provide the required driving power to the cascaded cells. However, the two outputs are not balanced because carry is used to generate the internal signal \( \overline{\text{sum}} \), causing the output sum often generated later than the output carry. The skew of the sum and carry outputs causes the carry-save outputs of the Wallace/Dadda tree [WAN95] to have irregular input arrival profile, making it difficult to exploit the advantage of fast carry propagate adder. The delay of a fast carry propagate adder accounts for 25% to 35% of the total tree structured multiplier delay and its power/delay are typically characterized based on even or unimodal input arrival profile. Equalizing the different paths of the adder cells in a tree-structured multiplier can also help to minimize glitches to conserve power.

Fig. 3.2(b) shows the complementary pass transistor logic (CPL) [CHA95] full adder with swing restoration. Its dual rail structure uses 32 transistors. Output inverters are also used to ensure its drivability. A transmission function full adder (TFA) [ZHU92] based on the transmission function theory is shown in Fig 3.2(c). A transmission gate adder (TGA) [WES93] in Fig. 3.2(d) uses CMOS transmission gates to fulfill the XOR and multiplex functions. The two latter circuits have fewer transistors than the previous two full adders. Even lesser transistor count adder circuits have been reported, mostly exploiting the non-full swing pass transistors with swing restored transmission gate techniques. This is exemplified by the state-of-the-art design of 14T [VES99(1)] in Fig. 3.2(e) and 10T [BUI02] in Fig. 3.2(f). The above adders differ not only in their transistor counts but also in the way their intermediate signals are generated. They will be simulated under the same process technology and environment to compare with our proposed 1-bit full adder cells on various key performance factors.
Chapter 3  Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells

(a) C-CMOS

(b) CPL

(c) TFA

(d) TGA

(e) 14T

(f) 10T

Figure 3.2 Existing full adder cells
Chapter 3  Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells

Unfortunately, different logic styles tend to favor one performance aspect at the expense of the other. Therefore, besides power, delay and power-delay product, other performance criteria, such as the working voltage range, outputs balancing, full voltage swing at any nodes, and driving capability [BLA94, VA101, SHA99, SHA02, ZIM97] are taken into consideration in the design and evaluation of adder cells for the tree-structured applications.

3.3 Full adder architecture and its building blocks

Most full adder designs, such as Fig. 3.2(c) to Fig. 3.2(f), can be partitioned into three modules for analysis. They are the XOR/XNOR block to generate the XOR and/or XNOR signals to control the other blocks, the XOR block to generate the sum output, and the carry generator block, which is usually implemented in multiplexer, to output the carry. The block diagram is shown in Fig. 3.3.

A full adder is considered as a 3-2 counter when it is used as the carry-saved adder (CSA) in tree-structured multiplier. Some of the circuits used to implement its constituent modules are also applicable to similar modules of high input counters or compressors used in the CSA tree. The design of 4-2 and 5-2 compressors is the main topic of the next chapter. Relevant circuits presented in this section will also be referred by the next chapter.
3.3.1 XOR/XNOR module

Fig. 3.4 shows several designs [BUI02, RAD01, RAD00, SHA97, SHA00, SHA02] for the XOR/XNOR modules. The numbers next to the transistors are their optimized sizes used for the compressors described in Chapter 4. The design of Fig. 3.4(a) has the least number of transistors and consumes very low power [SHA97, SHA00, SHA02]. However, it generates a weak logic ‘1’ at the xnor node when the primary inputs are both ‘1’s, which prevents it from functioning reliably at low supply voltage. The design of Fig. 3.4(b) is able to operate at low voltage, but it is not power efficient [SHA97, SHA00, SHA02]. Both designs, Fig. 3.4(a) and 3.4(b), use inverters to generate the complementary XOR and XNOR signals, therefore, their outputs skew heavily in time.
The design of Fig. 3.4(c) consists of two cross back XOR and XNOR cells [SHA97, SHA00, SHA02]. It is able to generate the complementary XOR and XNOR outputs simultaneously. However, it performs non full-swing operations for some input patterns causing their corresponding outputs to be degraded by $1V_{th}$. For example, the xnor output transmits a weak logic "1" when both inputs are "1"s, whereas the xor output transmits a weak logic "0" when both inputs are "0"s. When the power supply voltage is lower than $1V$, the week logic transmission will slow down the charging and discharging speed of the driven circuits, or worse, unable to turn on or off the driven transistors as desired. Therefore, it is also not a suitable candidate for low voltage operation.

The combined XOR-XNOR cell of Fig. 3.4(d) was proposed in [RAD01, RAD00]. It is a low power circuit with the least number of transistors that can output XOR and XNOR concurrently. It eliminates the transmission of weak logic for certain input patterns by virtue of the feedback PMOS-NMOS transistors in the midst of the circuit. Nevertheless, it is still not suitable for low voltage applications for the following reason. When the inputs change from "01" or "10" to "00" or "11", the feedback transistors that were turned off originally will be turned on by both a weak logic driver and a high impedance driver. This transition takes a long time at very low voltage slightly above $2V_{thp}$. Let us analyze the process of input changing from "01" or "10" to "00". When the current input is "01" or "10", logic "0" is passed through the NMOS transistor and logic "1" through the PMOS transistor. Both of the feedback transistors are in off status. However, when the next input "00" arrives, a weak "0" ($V = V_{thp}$) is passed through two PMOS pass transistors to the xor output, while the xnor output stays at high impedance at the
beginning of the transitions. This weak "0" turns on the feedback PMOS so that the xnor output is pulled up to logic "1", which turns on the feedback NMOS to discharge the xor output completely to ground. In the first half of the transition, the feedback transistors are driven by the cross-back transistors, while in the next half, the feedback transistors can drive themselves by positive feedback. Thus a staircase shaped voltage waveform occurs during the transition process in the output node xor, as shown in Fig. 3.5. Similar situation happens at the transition from "01" or "10" to "11", too. Owing to the lower $V_{th}$ and high electron mobility of the NMOS transistor, the entire process is faster than the previous case, which relies heavily on the PMOS switch. Anyway, these slow transitions increase the short circuit current of the following stage tremendously, which lead to a rise in the power dissipation.

![Figure 3.5 Slow step output transitions when ab change from 01, 10 to 00 for the XOR/XNOR circuit of Fig. 3.4(d) at supply voltage of 0.8V](image)

To overcome the weakness of the circuit in Fig. 3.4(d), we proposed a new circuit for the XOR/XNOR module. The proposed circuit is shown in Fig. 3.4(e), which is also able to generate the xor and xnor outputs simultaneously. Two series PMOS are added to solve the slow transition problem of changing from "01" to "00" while another two series NMOS are added to solve the problem of changing from "10" to "11". When the inputs changed to "00", the xnor output obtains a strong "1" through two series pull-up PMOS to the power supply, avoiding the high-impedance state happening in Fig. 3.4(d). Similarly, the xor output obtains a strong "0" through two series pull-down NMOS to ground when the inputs changed to "11". Therefore, for any input combination, at least one node is of strong logic to drive the feedback transistors. Hence, the proposed circuit
is rather robust against delay driven voltage scaling. Fig. 3.4(f) shows the use of a dual-rail multiplexer, such as CPL and DPL multiplexers, to construct the XOR-XNOR module, which can be used for the fully multiplexer-based implementation [PRA01].

### 3.3.2 XOR module for sum output

The XOR module to generate the *sum* output usually does not need to provide the XNOR output, but it should provide sufficient output current to drive the next stage of full adders or compressors (to be discussed in the next chapter). Fig. 3.6 shows two possible XOR gate designs for this simple XOR module [SHA97, SHA00, SHA02]. Fig. 3.6(a) is a low power design of the XOR function, but the limited drivability prevents it from being used as an output module in full adder and compressor. Fig. 3.6(b) is an XNOR circuit followed by an inverter, which has stronger driving capability than Fig. 3.6(a). The numbers enclosed in brackets are the sizes of the transistors used in the 5-2 compressors described in Chapter 4.

![Figure 3.6 XOR module circuits for sum output](image)

### 3.3.3 Carry generator module

The carry generator modules are usually implemented by multiplexers. Several designs are shown in Fig. 3.7. Fig. 3.7(a) is widely used in low power full adder cells [SHA97, SHA00, SHA02]. However, its driving capability is somewhat limited, which causes signal decay when many stages are to be cascaded. So it is not suitable for use in the tree structured application. Fig. 3.7(b) improves the driving capability by adding an output buffer at the expense of increasing its power dissipation [SHA97, SHA00, SHA02]. The output buffer formed by the cascaded inverters is designed such that the first inverter is
Chapter 3  
Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells

donwized by halved that of the output inverter in order to cut down the power dissipation.

![Diagram of carry generator module using MUX](image)

**Figure 3.7** Implementations of the carry generator module using MUX

The circuit of Fig. 3.7(c) is a multiplexer implemented in standard complementary CMOS logic style [GU03(1), ZIM97]. Being a complementary CMOS circuit, it is robust against both voltage scaling and transistor sizing. Despite having one inverter lesser than
Chapter 3  \textit{Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells}

the design of Fig. 3.7(b), this circuit still delivers sufficient drive to its succeeding circuits through the output inverter. The total number of transistors of this circuit is 10. Although it is two more than that of Fig. 3.7(b), the silicon areas occupied by both circuits are almost the same. This is because Fig. 3.7(b) requires more space to segregate the different diffusion areas, which increases the routing complexity of the interconnecting lines. The regularity of the layout of the circuit of Fig. 3.7(c) is evident from the diagram on its right. The multiplexer circuits of Fig. 3.7(d) and Fig. 3.7(e) are implemented in CPL and DPL logic styles, respectively [PRA01, ZIM97]. As they are dual-rail circuits, complementary pairs of primary inputs and outputs need to be generated. Although they can generate full-swing outputs, due to the pass transistor structure, they will not provide adequate drivability if many such circuits are cascaded, particularly at low supply voltage. Therefore if these circuits are used at the output ports of the full adder or compressor, output buffers are required to strengthen the signals.

A new complementary CMOS logic style circuit, as shown in Fig. 3.8, is suggested for the implementation of the carry generator module for the full adder. It resembles the circuit of Fig. 3.7(c), except that an additional input \(c\) is introduced and the original inputs of \(a\) and \(b\) are rearranged. Comparing with the design of Fig. 3.7(c), this design lowers the switched capacitance of the preceding XOR/XNOR module due to the reduction of its fanouts from 2 to 1 for both the XOR and XNOR outputs. It is also easier to layout without the cross lines of \(\overline{sel}\) and \(sel\). Its robustness against voltage scaling and transistor sizing (high noise margins) enables it to operate reliably at low voltage and arbitrary (even minimal) transistor size.

![Figure 3.8 Proposed carry generator module for full adder](image.png)
3.3.4 Circuit Structure of Hybrid 1 and Hybrid 2

The XOR/XNOR module of Fig. 3.4(d) in 6T, together with the XOR module of Fig. 3.6(b) and the carry generator module of Fig. 3.8 form a novel 1-bit full adder cell, abbreviated as Hybrid-1, which is shown in Fig. 3.9. The proposed XOR/XNOR module of Fig. 3.4(e), together with the XOR module of Fig. 3.6(b) and the carry generator module of Fig. 3.8 form another new 1-bit full adder cell, abbreviated as Hybrid-2, as shown in Fig. 3.10.

Figure 3.9 Proposed 1-bit full adder circuit - Hybrid 1
3.4 Simulation results

For a fair comparison, different full adders are optimized and simulated in a practical environment and with a unified optimization procedure based on the same process technology.

3.4.1 Simulation environment

It has been a common practice to treat the adder cell as a stand alone cell in the simulation [BUI02, RAD01, VES99(1), ZHU92]. It is also not unusual that the adder cells that perform well in such simulation still fail upon actual deployment because of the lack of driving power. This is because adder cells are normally cascaded to form a useful arithmetic circuit. Therefore, the adder cells must possess sufficient drivability to provide the next cell with clean inputs [SHA02]. In short, the driving cell must provide almost full-swing outputs to the driven cells. Otherwise the performance of the circuit will be degraded dramatically or become non-operative at low supply voltage. For this reason,
Chapter 3  
Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells

the adder cells of TFA, TGA, 14T and 10T cannot be cascaded without additional buffers attached to the outputs of each cell. This will be further verified later.

The authors of [SHA02] suggested one circuit structure, which is made of four cascaded adder cells, as shown in Fig. 3.11. This structure simulates the circuits like regular multipliers and binary adders that use full-adder cells as the building block. The inputs are fed from the buffers (two cascaded inverters) to give more realistic input signals and the outputs are loaded with buffers to give proper loading condition. All the required input-pattern-to-input-pattern transitions are included in the test patterns. The power consumption value is measured for the four cascaded adder cells, in addition to the intermediate buffers, while the delay is measured from the moment the inputs are applied to the first cell, until the latest of the sum and carry signals of the fourth cell is produced. However, this structure has some shortcomings. Firstly, although the first adder has exercised all the input-pattern-to-input-pattern transitions, the subsequent adders may not have all the input-pattern-to-input-pattern transitions exercised. Thus, it is not appropriate to consider the four cascaded cells as a whole and then divide the average power by four. As the last three adders are likely to consume lesser power than the first adder, this simulation tends to produce more optimistic power dissipation. Instead, it would be better to measure only the power dissipated by the first adder. Secondly, it is also noticed that every carry has two fan-outs while sum has only one fan-out. The loading of the two outputs is unbalanced.

![Figure 3.11 Simulation setup suggested in [SHA02]](image)

Figure 3.11 Simulation setup suggested in [SHA02]
Chapter 3  Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells

Our proposed simulation structure is shown in Fig 3.12, which emulates the tree structure of a parallel multiplier. Altogether 12 identical full adders are used, with the full adder (FA) marked with * being the cell of interest. The input signals of FA* are fed from the outputs of FA in the preceding stage, while the outputs of FA* are used to drive a FA in the following stage. This arrangement of full adders ensures that either the carry or sum output of each FA drives only one input of the FA in the next stage. The reason of cascading three levels of FAs preceding FA* is to examine the output drivability of the FA cells. If the FAs cannot provide enough driving power, the output signals after three successive stages will become very weak. Under this situation, FA* may fail to function.

The six circuits C-CMOS, CPL, TFA, TGA, 14T, 10T of Fig. 3.2 and our proposed adder cell, Hybrid-1 and Hybrid-2 are prototyped and simulated using the TSMC 0.18μm CMOS process with Level 49 technology file. The threshold voltages of the PMOS and NMOS transistors are around 0.46V and 0.48V, respectively. Star HSPICE is the circuit simulator used. For each simulation, HSPICE will generate an average power consumption value. The circuits are simulated at supply voltages range from 0.8 to 2.4V.
The operating frequency is set at 100MHz. Backward derivation is performed to find a group of input test patterns which offers all the 56 different transitions from one input combination to another to FA* [SHA99]. For each transition, the delay is measured from 50% of the input voltage swing to 50% of the output voltage swing. The maximum delay of these 56 transitions is taken as the cell delay.

![Waveform snapshots of the circuits with attached buffers (1.8V, 100MHz)](image)

**Figure 3.13 Waveform snapshots of the circuits with attached buffers (1.8V, 100MHz)**

Buffers or inverters are attached to the TFA, TGA, 14T and 10T circuits to enhance their driving capability. Fig. 3.13 and Fig. 3.14 show the output waveforms of those four adders with and without the buffers added, respectively. The distorted output waveforms of Fig. 3.14 prove how important the drivability of adder cell is to the correct functionality of the circuit.
3.4.2 Transistor sizing optimization

As shown in [SAY02], the transistor sizing for optimal performance is technology dependent. The results of the six contender full adders are simulated under different process technologies in the literatures. In order to perform a fair comparison, we proposed an optimization procedure to size the transistor of all the full adders including these six reported full adders based on the same TSMC 0.18μm CMOS process technology.

The scaling operations are carried out in iterations transistor by transistor. To provide a good tradeoff between the somewhat conflicting power and delay performances, the goal of the optimization is to minimize the power-delay product (PDP). The power efficiency, or the power-delay product, measured in fJ is defined as the product of the worst-case delay and the average power consumption. This metric provides an indication of the
energy expended and the life span of the battery when the circuit is operating at its maximum speed. Suppose a circuit for optimization is composed of $N$ transistors, labeled from $T_1$ to $T_N$ and they are initialized with reasonable sizes at the outset. For a certain technology the channel lengths of all transistors are fixed at the minimal feature size, say $0.18\mu m$ in our example, so the only variable to be optimized is the channel width of each transistor. The first optimization run is begun with varying the channel width of $T_1$ in $2m+1$ steps and a step size of $\psi$ to probe the circuit performance. In other words, the different channel widths of $T_1$ simulated are $l_{1,0} - m\psi, l_{1,0} - (m-1)\psi, ..., l_{1,0}, ..., l_{1,0} + (m - 1)\psi, l_{1,0} + m\psi$, where $l_{1,0}$ is the initial size of $T_1$. The probing sizes for $T_1$ are formally expressed as:

$$l_{i,1} = l_{1,0} + i\psi$$

for $i = -m, -m+1, ..., 0, ..., m-1, m$. (3.4)

During this run, the sizes of all other transistors remaining unchanged.

Suppose that the $j$th channel width $l_{i,j}$ of $T_1$ provides the circuit with the lowest power-delay product through the simulation. We update $T_1$ with $l_{i,j}$ and carry on with the second run for $T_2$. After the second run, $T_2$ will be updated with its best channel width. The process goes on until the last transistor $T_N$ is updated. An iteration is said to be completed when all the transistors have been updated. However, one iteration is not sufficient for the optimization because when a new transistor is sized in the current run, the other transistor sizes updated in the previous run may no longer maintain its optimality. Therefore, more iterations beginning with $T_1$ are needed. The iteration process stops when the performance difference in two successive iterations is smaller than a given error $\varepsilon$. Let $\Theta_{i-1}$ and $\Theta_i$ be the optimized power-delay product at the end of the $(k-1)$-th and $k$-th iterations, respectively. The termination criterion is given by:

$$\frac{\Theta_i - \Theta_{i-1}}{\Theta_i} \leq \varepsilon$$

(3.5)

The flow chart of our proposed transistor sizing optimization procedure is shown in Fig. 3.15. In order to obtain enough coverage so that the optimal or quasi-optimal operating point would fall into the search region, and to allow for fine calibration, the resolution of
the sizing step \( \psi \) may be made variable. Large step size is used at the first few iterations and smaller step size is used for the remaining iterations.

![Optimization flowchart](image)

**Figure 3.15 Optimization flowchart**

In what follows, we will verify that our proposed optimization procedure is a convergent algorithm. In the above procedure, the sizing operation is carried out one transistor at a time such that the width of the transistor being tuned is set to a new value only when it improves the circuit's power-delay product. Therefore, the power-delay product at the end of one iteration will never be worse than that of the previous one, that is,

\[
\Theta_i < \Theta_{i-1}.
\]  

(3.6)

Besides, all power-delay product values, including the power-delay product of the starting circuit, \( \Theta_o \), for a working circuit are finite non-negative numbers, that is
Chapter 3  Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells

\[ \Theta_k \geq 0. \]  \hspace{1cm} \text{(3.7)}

From (3.6) and (3.7), we have
\[ 0 \leq \Theta_k \leq \Theta_0, \quad \text{(for any } k) \] \hspace{1cm} \text{(3.8)}

We define \( \Delta_k = \Theta_{k+1} - \Theta_k \). From (3.6),
\[ \Delta_k \geq 0. \] \hspace{1cm} \text{(3.9)}

Thus, we have
\[ \Theta_K = \Theta_0 - \sum_{k=1}^{K} (\Theta_{k+1} - \Theta_k) \]
\[ = \Theta_0 - \sum_{k=1}^{K} \Delta_k \] \hspace{1cm} \text{(3.10)}

Its limit is given by
\[ \lim_{K \to \infty} \Theta_K = \Theta_0 - \lim_{K \to \infty} \sum_{k=1}^{K} \Delta_k \]

If the proposed optimization procedure is not convergent, i.e., \( \lim_{K \to \infty} \Theta_K \) diverges, which implies the divergence of \( \lim_{K \to \infty} \sum_{k=1}^{K} \Delta_k \). From (3.9), the only possible situation for the divergence is
\[ \lim_{K \to \infty} \sum_{k=1}^{K} \Delta_k \to \infty \] \hspace{1cm} \text{(3.11)}

which implies \( \lim_{K \to \infty} \Theta_K \to -\infty \). This has violated (3.8). That establishes the claim that our optimization procedure is convergent.

In our optimization procedure, the starting sizes for the six previously reported full adders are the transistors' aspect ratio reported in [BU102, CHA95, VES99(1), WES93, ZHU92]. For our proposed Hybrid-1 and Hybrid-2 cells, the initial sizes are estimated from standard practices and past experience. The step size of the first iteration in our example is set to 0.05\( \mu \text{m} \), which is around 10% to 20% of the initial channel width. The step size of the subsequent iterations is reduced to 0.02\( \mu \text{m} \). Thus, the final transistor sizes have the precision of 10% of the channel length, which is 0.18\( \mu \text{m} \) for our targeted technology. The iteration process termination error is set to 1%.
Two optimization strategies are adopted in the above procedure of transistor sizing to accelerate the process. Firstly, the corresponding PMOS and NMOS in a complementary pair are optimized in successive runs because the output transitions of the node driven by one transistor is often influenced most by the driving capability of its complementary counterpart. For example, the sizing of p2 and n2 in TGA are carried out in succession. Secondly, series transistors or parallel transistors of same type sourcing current to or sinking current from the same node are treated equally and can be optimized simultaneously. In C-CMOS, transistors p6, p7 and p8 are the parallel example while transistors p1 and p2 are the series example.

Optimization of the transistor sizing is carried out at two different voltages, 0.8V and 1.8V for C-CMOS, CPL, TFA, TGA and Hybrid-2. As Hybrid-1 and 14T can only work above 1.0V, these two cells are optimized at 1.0V and 1.8V. The lowest voltage that 10T can function under the proposed simulation setup is 1.8V, so it is optimized only at 1.8V. The final optimized transistor widths for each full adder cell are listed in Table 3.1.

Table 3.1 Transistor sizes (μm) of full adders optimized for power-delay product

(a) C-CMOS

<table>
<thead>
<tr>
<th></th>
<th>p1</th>
<th>p2</th>
<th>p3</th>
<th>p4</th>
<th>p5</th>
<th>p6</th>
<th>p7</th>
<th>p8</th>
<th>p9</th>
<th>p10</th>
<th>p11</th>
<th>p12</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.8V</td>
<td>0.78</td>
<td>0.78</td>
<td>0.8</td>
<td>0.8</td>
<td>0.8</td>
<td>1.2</td>
<td>1.2</td>
<td>1.2</td>
<td>0.54</td>
<td>0.68</td>
<td>0.68</td>
<td>0.68</td>
</tr>
<tr>
<td>1.8V</td>
<td>1.1</td>
<td>1.1</td>
<td>0.8</td>
<td>0.46</td>
<td>0.46</td>
<td>0.46</td>
<td>0.7</td>
<td>0.74</td>
<td>0.74</td>
<td>0.74</td>
<td></td>
<td></td>
</tr>
<tr>
<td>n1</td>
<td>n2</td>
<td>n3</td>
<td>n4</td>
<td>n5</td>
<td>n6</td>
<td>n7</td>
<td>n8</td>
<td>n9</td>
<td>n10</td>
<td>n11</td>
<td>n12</td>
<td></td>
</tr>
<tr>
<td>0.8V</td>
<td>0.36</td>
<td>0.36</td>
<td>0.4</td>
<td>0.4</td>
<td>0.34</td>
<td>0.46</td>
<td>0.46</td>
<td>0.34</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
<td></td>
</tr>
<tr>
<td>1.8V</td>
<td>0.4</td>
<td>0.4</td>
<td>0.46</td>
<td>0.3</td>
<td>0.4</td>
<td>0.46</td>
<td>0.3</td>
<td>0.4</td>
<td>0.5</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
</tr>
</tbody>
</table>

(b) CPL

<table>
<thead>
<tr>
<th></th>
<th>p1</th>
<th>p2</th>
<th>p3</th>
<th>p4</th>
<th>n1</th>
<th>n2</th>
<th>n3</th>
<th>n4</th>
<th>n5</th>
<th>n6</th>
<th>n7</th>
<th>n8</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.8V</td>
<td>0.5</td>
<td>0.4</td>
<td>0.5</td>
<td>0.5</td>
<td>0.6</td>
<td>0.6</td>
<td>0.41</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
<td>0.4</td>
<td>0.41</td>
</tr>
<tr>
<td>1.8V</td>
<td>0.4</td>
<td>0.4</td>
<td>0.4</td>
<td>0.4</td>
<td>0.55</td>
<td>0.55</td>
<td>0.45</td>
<td>0.5</td>
<td>0.5</td>
<td>0.5</td>
<td>0.7</td>
<td>0.35</td>
</tr>
<tr>
<td>n9</td>
<td>n10</td>
<td>n11</td>
<td>n12</td>
<td>n13</td>
<td>n14</td>
<td>n15</td>
<td>n16</td>
<td>n17</td>
<td>n18</td>
<td>n19</td>
<td>n20</td>
<td></td>
</tr>
<tr>
<td>0.8V</td>
<td>0.37</td>
<td>0.37</td>
<td>0.3</td>
<td>0.44</td>
<td>0.44</td>
<td>0.58</td>
<td>0.38</td>
<td>0.38</td>
<td>0.4</td>
<td>0.4</td>
<td>0.4</td>
<td>0.3</td>
</tr>
<tr>
<td>1.8V</td>
<td>0.45</td>
<td>0.45</td>
<td>0.5</td>
<td>0.5</td>
<td>0.5</td>
<td>0.5</td>
<td>0.5</td>
<td>0.5</td>
<td>0.5</td>
<td>0.5</td>
<td>0.5</td>
<td>0.5</td>
</tr>
</tbody>
</table>
Chapter 3  
Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells

### (c) TFA

<table>
<thead>
<tr>
<th></th>
<th>p1</th>
<th>p2</th>
<th>p3</th>
<th>p4</th>
<th>p5</th>
<th>p6</th>
<th>p7</th>
<th>p8</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.8V</td>
<td>0.57</td>
<td>0.6</td>
<td>0.8</td>
<td>0.3</td>
<td>0.45</td>
<td>0.4</td>
<td>0.3</td>
<td>0.3</td>
</tr>
<tr>
<td>1.8V</td>
<td>0.31</td>
<td>0.78</td>
<td>0.31</td>
<td>0.33</td>
<td>0.45</td>
<td>0.3</td>
<td>0.7</td>
<td>0.32</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th></th>
<th>p1</th>
<th>p2</th>
<th>p3</th>
<th>p4</th>
<th>p5</th>
<th>p6</th>
<th>p7</th>
<th>p8</th>
<th>p9</th>
<th>p10</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.8V</td>
<td>0.3</td>
<td>0.32</td>
<td>0.88</td>
<td>0.51</td>
<td>1.01</td>
<td>0.32</td>
<td>0.6</td>
<td>0.3</td>
<td>0.61</td>
<td>0.34</td>
</tr>
<tr>
<td>1.8V</td>
<td>0.34</td>
<td>0.35</td>
<td>0.78</td>
<td>0.4</td>
<td>0.65</td>
<td>0.3</td>
<td>0.44</td>
<td>0.3</td>
<td>0.5</td>
<td>0.4</td>
</tr>
</tbody>
</table>

### (d) TGA

<table>
<thead>
<tr>
<th></th>
<th>p1</th>
<th>p2</th>
<th>p3</th>
<th>p4</th>
<th>p5</th>
<th>p6</th>
<th>p7</th>
<th>p8</th>
<th>p9</th>
<th>p10</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.8V</td>
<td>0.3</td>
<td>0.3</td>
<td>0.31</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
</tr>
<tr>
<td>1.8V</td>
<td>0.31</td>
<td>0.34</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
</tr>
</tbody>
</table>

### (e) 14T

<table>
<thead>
<tr>
<th></th>
<th>p1</th>
<th>p2</th>
<th>p3</th>
<th>p4</th>
<th>p5</th>
<th>p6</th>
<th>p7</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.0V</td>
<td>1.85</td>
<td>1.85</td>
<td>0.88</td>
<td>1.75</td>
<td>0.75</td>
<td>1.7</td>
<td>1.8</td>
</tr>
<tr>
<td>1.8V</td>
<td>1.7</td>
<td>1.7</td>
<td>0.6</td>
<td>1.7</td>
<td>0.38</td>
<td>1.5</td>
<td>1.7</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th></th>
<th>p1</th>
<th>p2</th>
<th>p3</th>
<th>p4</th>
<th>p5</th>
<th>p6</th>
<th>p7</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.0V</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
<td>0.36</td>
<td>1.62</td>
<td>0.75</td>
<td>0.4</td>
</tr>
<tr>
<td>1.8V</td>
<td>0.7</td>
<td>0.7</td>
<td>0.3</td>
<td>0.4</td>
<td>1.5</td>
<td>0.4</td>
<td>0.4</td>
</tr>
</tbody>
</table>

### (f) 10T

<table>
<thead>
<tr>
<th></th>
<th>p1</th>
<th>p2</th>
<th>p3</th>
<th>p4</th>
<th>p5</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.8V</td>
<td>0.95</td>
<td>0.78</td>
<td>0.7</td>
<td>0.7</td>
<td>0.6</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th></th>
<th>p1</th>
<th>p2</th>
<th>p3</th>
<th>p4</th>
<th>p5</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.8V</td>
<td>1.03</td>
<td>0.7</td>
<td>0.7</td>
<td>0.7</td>
<td>0.7</td>
</tr>
</tbody>
</table>

### (g) Hybrid-1

<table>
<thead>
<tr>
<th></th>
<th>p1</th>
<th>p2</th>
<th>p3</th>
<th>p4</th>
<th>p5</th>
<th>p6</th>
<th>p7</th>
<th>p8</th>
<th>p9</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.0V</td>
<td>0.9</td>
<td>0.9</td>
<td>0.3</td>
<td>0.4</td>
<td>0.3</td>
<td>0.7</td>
<td>0.7</td>
<td>0.5</td>
<td>0.5</td>
</tr>
<tr>
<td>1.8V</td>
<td>0.71</td>
<td>0.71</td>
<td>0.65</td>
<td>0.3</td>
<td>0.35</td>
<td>0.5</td>
<td>0.5</td>
<td>0.54</td>
<td>0.54</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th></th>
<th>p1</th>
<th>p2</th>
<th>p3</th>
<th>p4</th>
<th>p5</th>
<th>p6</th>
<th>p7</th>
<th>p8</th>
<th>p9</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.0V</td>
<td>0.32</td>
<td>0.32</td>
<td>0.3</td>
<td>0.4</td>
<td>0.55</td>
<td>0.38</td>
<td>0.38</td>
<td>0.32</td>
<td>0.32</td>
</tr>
<tr>
<td>1.8V</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
<td>0.3</td>
<td>0.4</td>
<td>0.4</td>
<td>0.4</td>
<td>0.4</td>
</tr>
</tbody>
</table>
3.4.3 Simulation results and analysis

The pre-layout simulation of power, delay and power-delay product plotted against supply voltage ranges from 0.8V to 2.4V of the Hybrid-1, Hybrid-2, C-CMOS, CPL, TFA, TGA, 14T, 10T are shown in Table 3.2 for comparison. The transistor sizes of Table 3.1 optimized at 0.8V and 1V are used for the simulation at the lower supply voltage range of 0.8V to 1.2V and the transistor sizes optimized at 1.8V are used for the simulation at the higher supply voltage range.
### Table 3.2: Power, delay and power-delay-product comparison of full adder cells

<table>
<thead>
<tr>
<th>Vcc(V)</th>
<th>0.8</th>
<th>1.0</th>
<th>1.2</th>
<th>1.4</th>
<th>1.6</th>
<th>1.8</th>
<th>2.0</th>
<th>2.2</th>
<th>2.4</th>
</tr>
</thead>
<tbody>
<tr>
<td>Power (μW)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Hybrid-1</td>
<td>2.20</td>
<td>4.22</td>
<td>4.32</td>
<td>5.47</td>
<td>7.09</td>
<td>10.6</td>
<td>13.2</td>
<td>17.6</td>
<td></td>
</tr>
<tr>
<td>Hybrid-2</td>
<td>0.918</td>
<td>1.50</td>
<td>2.35</td>
<td>3.49</td>
<td>4.75</td>
<td>6.39</td>
<td>8.71</td>
<td>12.3</td>
<td>15.9</td>
</tr>
<tr>
<td>C-CMOS</td>
<td>0.84</td>
<td>1.45</td>
<td>2.12</td>
<td>3.63</td>
<td>4.91</td>
<td>6.23</td>
<td>8.77</td>
<td>12.4</td>
<td>15.9</td>
</tr>
<tr>
<td>CPL</td>
<td>1.03</td>
<td>1.70</td>
<td>2.64</td>
<td>4.08</td>
<td>5.64</td>
<td>7.72</td>
<td>11.2</td>
<td>14.0</td>
<td>17.7</td>
</tr>
<tr>
<td>TFA</td>
<td>1.50</td>
<td>2.28</td>
<td>3.60</td>
<td>4.56</td>
<td>6.25</td>
<td>8.25</td>
<td>10.6</td>
<td>14.9</td>
<td>17.6</td>
</tr>
<tr>
<td>TGA</td>
<td>1.49</td>
<td>2.20</td>
<td>3.30</td>
<td>4.29</td>
<td>6.10</td>
<td>8.74</td>
<td>10.0</td>
<td>12.6</td>
<td>16.5</td>
</tr>
<tr>
<td>14T</td>
<td>3.66</td>
<td>7.62</td>
<td>8.14</td>
<td>9.82</td>
<td>12.7</td>
<td>18.8</td>
<td>26.0</td>
<td>31.0</td>
<td></td>
</tr>
<tr>
<td>10T</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Delay (ns)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Hybrid-1</td>
<td>8.56</td>
<td>3.63</td>
<td>0.734</td>
<td>0.484</td>
<td>0.376</td>
<td>0.316</td>
<td>0.276</td>
<td>0.256</td>
<td></td>
</tr>
<tr>
<td>Hybrid-2</td>
<td>1.40</td>
<td>0.708</td>
<td>0.479</td>
<td>0.412</td>
<td>0.320</td>
<td>0.275</td>
<td>0.239</td>
<td>0.231</td>
<td></td>
</tr>
<tr>
<td>C-CMOS</td>
<td>1.42</td>
<td>0.756</td>
<td>0.531</td>
<td>0.397</td>
<td>0.333</td>
<td>0.292</td>
<td>0.269</td>
<td>0.252</td>
<td>0.244</td>
</tr>
<tr>
<td>CPL</td>
<td>0.908</td>
<td>0.468</td>
<td>0.321</td>
<td>0.236</td>
<td>0.197</td>
<td>0.184</td>
<td>0.179</td>
<td>0.172</td>
<td>0.173</td>
</tr>
<tr>
<td>TFA</td>
<td>1.53</td>
<td>0.777</td>
<td>0.511</td>
<td>0.385</td>
<td>0.322</td>
<td>0.288</td>
<td>0.270</td>
<td>0.255</td>
<td>0.252</td>
</tr>
<tr>
<td>TGA</td>
<td>1.42</td>
<td>0.721</td>
<td>0.497</td>
<td>0.383</td>
<td>0.321</td>
<td>0.294</td>
<td>0.274</td>
<td>0.257</td>
<td>0.250</td>
</tr>
<tr>
<td>14T</td>
<td>3.60</td>
<td>2.06</td>
<td>0.902</td>
<td>0.531</td>
<td>0.382</td>
<td>0.303</td>
<td>0.271</td>
<td>0.268</td>
<td></td>
</tr>
<tr>
<td>10T</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>584</td>
</tr>
<tr>
<td>Power Delay Product (fJ)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Hybrid-1</td>
<td>18.83</td>
<td>15.32</td>
<td>3.171</td>
<td>2.647</td>
<td>2.666</td>
<td>3.350</td>
<td>3.643</td>
<td>4.506</td>
<td></td>
</tr>
<tr>
<td>Hybrid-2</td>
<td>1.285</td>
<td>1.062</td>
<td>1.126</td>
<td>1.438</td>
<td>1.520</td>
<td>1.757</td>
<td>2.230</td>
<td>2.940</td>
<td>3.673</td>
</tr>
<tr>
<td>C-CMOS</td>
<td>1.193</td>
<td>1.096</td>
<td>1.126</td>
<td>1.441</td>
<td>1.635</td>
<td>1.819</td>
<td>2.359</td>
<td>3.125</td>
<td>3.880</td>
</tr>
<tr>
<td>CPL</td>
<td>0.935</td>
<td>0.796</td>
<td>0.847</td>
<td>0.963</td>
<td>1.111</td>
<td>1.420</td>
<td>2.005</td>
<td>2.408</td>
<td>3.062</td>
</tr>
<tr>
<td>TFA</td>
<td>2.295</td>
<td>1.772</td>
<td>1.840</td>
<td>1.756</td>
<td>2.012</td>
<td>2.376</td>
<td>2.862</td>
<td>3.799</td>
<td>4.435</td>
</tr>
<tr>
<td>14T</td>
<td>31.33</td>
<td>27.66</td>
<td>5.97</td>
<td>4.75</td>
<td>4.78</td>
<td>5.94</td>
<td>7.18</td>
<td>7.94</td>
<td></td>
</tr>
<tr>
<td>10T</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>60.5</td>
</tr>
</tbody>
</table>

Fig. 3.16 shows the visualized comparison based on the data in Table 3.2.
Chapter 3  
*Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells*

(a) *Average power performance*

(b) *Worst-case delay performance*
The simulation results show that the 10T adder cell fails to function at low voltage. The lowest voltage it can operate at 100MHz is 1.8V. The excessive power dissipation and long delay are attributed to the threshold voltage drop problem and the poor driving capability of some internal nodes at input combinations that create non full-swing transitions.

The speed of the 14T decreases faster with the supply voltage than other adder cells, so does its power-delay product. The XOR/XNOR module circuit of 14T is the same as that of Hybrid-1 cell, so both of them suffer from the same threshold voltage drop problem that has been discussed in Section 3.3. Although Hybrid-1 cell has higher transistor count, it outperforms 14T in terms of power and power-delay-product, due to the merits of the XOR and carry generator modules. In addition, 14T fails to function at 0.8V.
Chapter 3  
**Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells**

The simulation results also show that C-CMOS, TFA, TGA and CPL and Hybrid-2 still can work reliably when the supply voltage has dropped to 0.8V. Although TFA and TGA have lesser transistor count, due to the lack of drivability, additional buffers are required at each output, which increased their short-circuit power as well as switching power.

It is shown that Hybrid-2 and C-CMOS full adder are the most power efficient cells. Hybrid-2 is slightly faster than C-CMOS and as a result, it exhibits smaller power-delay-product than C-CMOS except at very low voltage of 0.8V. Due to the decoupling of the input and output circuitries, good drivability can be achieved by the proposed Hybrid cell. The full swing restoring transistors in Module 1 of the Hybrid cell overcome the weak logic problem, allowing it to operate reliably down to an ultra low voltage of 0.8V.

Although Hybrid-2 has four more transistors than Hybrid-1, it does not consume more power than Hybrid-1. This is because the elimination of the threshold voltage drop problem has allowed full voltage swing at the intermediate nodes, XOR and XNOR. Thus, spurious transitions and unnecessary switching activities are reduced in the XOR and carry generator modules. Therefore, the overall power consumption of the full adder cell is lower. Since Hybrid-2 cell has better power and delay performance than Hybrid-1 cell, it is not surprising that it is more energy-efficient than Hybrid-1. This is particularly prominent at low supply voltage.

C-CMOS works reliably at all voltage range especially at very low voltage. However, as already mentioned in Section 3.2, it generates sum using carry signal as the input, which causes the undesirable additional delay. This delay skew also leads to more spurious transitions to the cascaded stage. We define a normalized delay difference factor $\delta$:

$$\delta = \frac{t_{\text{max}} - t_{\text{min}}}{t_{\text{max}}}$$  \hspace{1cm} (3.12)

where $t_{\text{max}}$ is the larger delay value of the sum and carry outputs, and $t_{\text{min}}$ is the smaller delay value of the two output signals.
Chapter 3 Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells

The $\delta$ values of Hybrid-2 cell and C-CMOS cell are plotted against the supply voltage in Fig. 3.17. It is shown that the outputs of Hybrid-2 cell are more balanced than those of C-CMOS especially at low voltage.

![Normalized delay difference factors between Hybrid-2 and C-CMOS](image)

**Figure 3.17** Normalized delay difference factors between Hybrid-2 and C-CMOS

Fig. 3.18 shows a snapshot of the sum and carry outputs of Hybrid-2 and C-CMOS full adder cells. Due to the unbalanced output of C-CMOS adder, its outputs generate more glitches than those of Hybrid-2 cell for the same input transitions.

![Comparison of output waveforms between Hybrid-2 and C-CMOS](image)

**Figure 3.18** Comparison of output waveforms between Hybrid-2 and C-CMOS

The skew of the sum and carry outputs causes the carry-save outputs of the Wallace/Dadda tree [WAN95] to have irregular input arrival profile. Although the fast
carry can be consumed in the critical path of the adder in the next stage and the slower sum be consumed in the faster path of the next stage for constructing an optimized adder tree \[OKL96\], such construction will annihilate the flexibility to re-distribute the adder cells to maximize the area efficiency and to eliminate cross-stage interconnections described by the technique presented in \[WAN95\]. Therefore, if the length of wiring and area efficiency are prominent cost function elements, which is the case for large arithmetic circuit in ultra deep submicron technology, balanced delays without jeopardizing the worst case delay can be a desirable attribute.

Despite being the fastest circuit, CPL consumes higher power than Hybrid-2 and C-CMOS because of its dual-rail structure and the substantial number of internal nodes. The additional inverters used to generate the complement inputs have also increased the power consumption. This excessive overhead offsets the advantages of efficient XOR realization offered by this logic design style. Although CPL has achieved an overall good power-delay product due to its excellent speed, the performance is highly sensitive to the transistor scaling as observed in the optimization process described in Section 3.4.2. Besides, the need to generate complementary signals for all the surrounding circuits also limits the usage of CPL.

The complete layout of the Hybrid-1 cell is shown in Fig. 3.19. On top of cleverly exploring the complementary outputs of the XOR/XNOR module to produce a full-swing carry, the new circuit of carry generator has a very regular structure. Comparing with the layouts of the other two modules, both of which have only 6 transistors, the area saved by the reduced layout complexity of the carry generator module has compensated for its increase in transistor count (10 transistors).
Chapter 3  Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells

The layouts of the Hybrid-2 cell are shown in Fig 3.20. Comparing with the Hybrid-1 cell, the layout of its Module 1 is more regular and fits nicely with the other two modules in a complete full adder cell.

Figure 3.19 Layout of the Hybrid-1 cell

Figure 3.20 Layout of the Hybrid-2 cell
All full adder cells are laid out with optimized sizing and spacing in compliance to the design rules of TSMC 0.18 μm CMOS process. The values of the length, width and overall area of the adder cells are listed in Table 3.3. The layouts of CPL and TGA full adders occupy the most silicon area. CPL adder needs more metal lines to connect the complementary inputs. Besides, the style of the transistor connection of CPL is not suitable for area-efficient layout. TGA adder is composed of transmission gates, which occupies more area due to the inefficient usage of the n-type wells. The layout of the Hybrid-2 cell occupies a much smaller silicon area, which is less than 60% of the area of the CPL. It is only slightly larger than 14T and 10T adders. The area ascendancy of 14T and 10T is due primary to their smaller number of transistors, but the area gained from the reduced transistor count is offset by the penalty of the irregular circuit structure of pass transistors. Besides, their overall performances are also inferior.

<table>
<thead>
<tr>
<th></th>
<th>Hybrid-1</th>
<th>Hybrid-2</th>
<th>C-CMOS</th>
<th>CPL</th>
<th>TFA</th>
<th>TGA</th>
<th>14T</th>
<th>10T</th>
</tr>
</thead>
<tbody>
<tr>
<td>Length (μm)</td>
<td>9.67</td>
<td>7.18</td>
<td>17.37</td>
<td>11.20</td>
<td>9.58</td>
<td>14.07</td>
<td>12.10</td>
<td>11.2</td>
</tr>
<tr>
<td>Width (μm)</td>
<td>8.415</td>
<td>10.97</td>
<td>5.76</td>
<td>12.20</td>
<td>10.08</td>
<td>9.59</td>
<td>5.64</td>
<td>6.30</td>
</tr>
<tr>
<td>Area (μm²)</td>
<td>81.373</td>
<td>78.76</td>
<td>100.05</td>
<td>136.64</td>
<td>96.57</td>
<td>134.88</td>
<td>68.24</td>
<td>70.56</td>
</tr>
</tbody>
</table>

The post layout simulation includes the parasitic information extracted from both the transistors as well as the interconnections. Due to the scale of the 1-bit full adder circuits, the power, delay and power-delay product of the extracted circuits (RC extraction) do not deviate much from those of the pre-layout circuits, especially for the Hybrid-2 cell. This is because of its regular structure and shorter wire length. The comparisons of the pre- and post-layout performance at some standard supply voltages are shown in Table 3.4.
Table 3.4 The pre- and post-layout simulation results of the Hybrid-1 and Hybrid-2 cells

<table>
<thead>
<tr>
<th></th>
<th>Hybrid-1 (pre)</th>
<th>Hybrid-1 (post)</th>
<th>Hybrid-2 (pre)</th>
<th>Hybrid-2 (post)</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>2.4V</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Power (µW)</td>
<td>17.63</td>
<td>19.24</td>
<td>15.92</td>
<td>16.32</td>
</tr>
<tr>
<td>Delay (ps)</td>
<td>256.02</td>
<td>252.41</td>
<td>231.43</td>
<td>234.98</td>
</tr>
<tr>
<td>PDP (fJ)</td>
<td>4.51</td>
<td>4.86</td>
<td>3.68</td>
<td>3.83</td>
</tr>
<tr>
<td><strong>1.8V</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Power (µW)</td>
<td>7.090</td>
<td>7.811</td>
<td>6.392</td>
<td>6.694</td>
</tr>
<tr>
<td>Delay (ps)</td>
<td>37.64</td>
<td>36.28</td>
<td>27.53</td>
<td>27.84</td>
</tr>
<tr>
<td>PDP (fJ)</td>
<td>2.67</td>
<td>2.83</td>
<td>1.76</td>
<td>1.86</td>
</tr>
<tr>
<td><strong>1.0V</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Power (µW)</td>
<td>2.20</td>
<td>2.46</td>
<td>1.50</td>
<td>1.47</td>
</tr>
<tr>
<td>Delay (ps)</td>
<td>856.31</td>
<td>950.86</td>
<td>708.35</td>
<td>675.81</td>
</tr>
<tr>
<td>PDP (fJ)</td>
<td>1.88</td>
<td>2.34</td>
<td>1.06</td>
<td>0.99</td>
</tr>
</tbody>
</table>

3.5 Summary

For full adder cell design, pass-logic circuit is thought to be dissipating minimal power and have smaller area because it uses less number of transistors. Thus, CPL adder is considered to be able to perform better than CMOS adder in [ZIM97]. However, in our opinion, pass-logic circuit usually has irregular structure, which increases the wiring complexity and its performance is highly susceptible to transistor sizing. On the other hand, the complementary CMOS logic circuit has the advantages of layout regularity and stability at low voltage. Therefore, it is the different design constraints imposed by the applications that each logic design style has its place in the cell library development.

An optimization procedure is proposed to size the transistor of the full adder cells in order to perform a fair comparison because the transistor sizing for optimal performance is technology dependent. The optimization procedure takes sweeping method to search the optimized transistor sizes in iterations, which has been proved a convergent algorithm.

In the past, full adders are often evaluated in isolation without concern on how they are deployed in actual circuit [BUI02, RAD01, VES99(1), ZHU92]. We argue that a 1-bit full adder cell that functions correctly in stand alone simulation is not sufficient to validate its actual performance or even functionality when it is integrated into a larger circuit. Inadequate consideration in simulation setup tends to produce the level of
Chapter 3 Design and Performance Evaluation of Low Voltage Low Power Full Adder Cells

performance optimistically above the capability of the circuit being simulated. In this chapter, we proposed a reasonably simple architecture to simulate the adder cell in an environment realistic to its actual deployment in the most frequently used parallel multiplier structure.

In summary, two novel 1-bit adder cells consisting of the XOR/XNOR, sum and carry out subcircuits, are proposed. The pass logic design style is used to efficiently generate the XOR and XNOR functions simultaneously and a good drivability carry output is generated by a novel complementary CMOS style circuit with regular layout. In addition, the last-stage inverter de-couples the output and input to improve the driving capability. Despite having higher transistor count than the recently reported designs, the two cells have shown to be highly power efficient over a wide supply voltage range. The balanced sum and carry outputs also offer considerable flexibility in allocating the adder cells in tree structured circuit to eliminate as many cross-stage interconnections and to reduce the maximum length of in-stage interconnections without affecting the critical path delay as there is no discrimination on any port for any legitimate connections [WAN95]. Although its power-delay performances is comparable to C-CMOS and poorer than CPL, the area efficient layout made it a good choice for implementing large tree structured arithmetic circuit when the overall performance and area efficiency are prominent cost function elements.
Chapter 4

Low Voltage, Low Power 4-2 and 5-2 Compressors for Fast Arithmetic Circuits

4.1 Introduction

In the last chapter, we explore different CMOS logic styles to the design and performance evaluation of low voltage, low power full adder cells. These primitive full adder cells are also referred to as 3-2 counters or compressors when they are used to compress the columns of the partial product matrix in the CSA tree of parallel multiplier. To reduce the number of stages in the CSA tree, 4-2 and 5-2 compressors are frequently employed in large high-speed multiplier. This chapter further presents several architectures and designs of low power 4-2 and 5-2 compressors capable of operating at ultra-low supply voltage. The architectures of these compressors are anatomized into their constituent modules and different static logic styles based on the same advanced submicron CMOS process model are used to realize them. Different configurations of each architecture, which include a number of novel 4-2 and 5-2 compressor designs, are prototyped and simulated to evaluate their performance in speed, power dissipation and power-delay product. The newly developed circuits are based on various configurations of the novel 5-2 compressor architecture with the new carry generation circuit, or existing architectures configured with the proposed circuit for the XOR-XNOR module. Driving capability has been considered in the design as well as in the simulation setup so that these 4-2 and 5-2 compressor cells can operate reliably in any tree structured parallel multiplier at very low supply voltage. Two new simulation environments are created to ensure that the performances reflect the realistic circuit operation in the system to which these cells are
integrated. Simulation results show that the 4-2 compressor with the proposed XOR-
XNOR module and the new fast 5-2 compressor architecture are able to function at
supply voltage as low as 0.6V, and outperform many other architectures including the
classical complementary CMOS logic compressors and variants of compressors
constructed with various combinations of recently reported superior low-power logic
cells [MAR99, PRA01, RAD00, KWO00].

The rest of this chapter is organized as follows. Section 4.2 investigates the performances
of several fast 4-2 compressors. The 4-2 compressors constructed around the proposed
XOR-XNOR cell in Chapter 2 exhibit superior power efficiency compared with other
configurations of the same architecture. Section 4.3 scrutinizes the 5-2 compressor
architectures and their underlining building modules. A new fast 5-2 compressor
architecture is proposed, together with a new circuit for its carry generation module. This
new architecture performs well with almost any configuration of logic styles and its
overall performance is the best among the known 5-2 compressor architectures. Special
circuits for the simulation of the 4-2 and 5-2 compressors are described in Section 4.4.
These setups emulate a realistic application environment that truly reflects its actual
operability in a tree-structured multiplier. The simulation results are given and analyzed.
Finally, the summary in Section 4.5 concludes the chapter.

4.2 4-2 compressor

A 4-2 compressor has five inputs and three outputs, as shown in Fig. 4.1. The four inputs
\(x_1, x_2, x_3\) and \(x_4\), and the output, \(\text{sum}\) have the same weight. The output, \(\text{carry}\) is weighted
one binary bit order higher. The 4-2 compressor receives an input, \(c_m\) from the preceding
module of one binary bit order lower in significance, and produces an output \(c_{\text{out}}\) to the
next compressor module of higher significance. Different structures of 4-2 compressors
exist and they all have to abide by the fundamental equation (4.1):

\[
x_1 + x_2 + x_3 + x_4 + c_m = \text{sum} + 2 \cdot (\text{carry} + c_{\text{out}})
\]

(4.1)
Besides, to accelerate the carry save summation of the partial products, it is imperative that the output, $c_{out}$, be independent of the input, $c_{in}$.

The conventional implementation of a 4-2 compressor is composed of two serially connected full adders, as shown in Fig. 4.2. At gate level, high input compressors are anatomized into XOR gates and carry generators normally implemented by multiplexers (MUX). Therefore, different designs can be classified based on the critical path delay in terms of the number of primitive gates. Let $\Delta_{\text{XOR}}$ denote the delay of an XOR gate and $\Delta_{\text{CGEN}}$ denote the delay of a carry generator. A compressor is said to have a delay of $(m\Delta_{\text{XOR}} + n\Delta_{\text{CGEN}})$ if its critical path consists of $m$ XOR gates and $n$ carry generators. Since the difference between the delays of widely used XOR gate and carry generator is trivial in an optimized design, the delay of the compressor is more commonly specified as $(m + n)\Delta$. Therefore, the straightforward implementation of a 4-2 compressor of Fig. 4.2 has a long critical path delay of $4\Delta$ [HSI98]. Additionally, due to the uneven delay profiles of the outputs arriving from different input paths, the carry save adder (CSA) tree for the partial product accumulation constructed from such cells tends to generate a lot of glitches.

**Figure 4.1 4-2 compressor**

**Figure 4.2 Conventional 4-2 compressor - 4\Delta**
A 4-2 compressor flattened and optimized at gate level to reduce the critical path delay is shown in Fig. 4.3 [RAD00]. It uses more than 80 transistors when implemented in conventional or complementary CMOS logic style. As this circuit is capable of functioning below 1V, it is used as a benchmark for evaluating the performance of other low voltage and low power 4-2 compressor circuits.

**Figure 4.3 Logic level optimized CMOS 4-2 compressor**

A recent design of 4-2 compressor [GU03(1), MAR99, PRA01, RAD00] is derived from the modified equations for the functions of Fig. 4.2. The three outputs of the design are described by (4.2), (4.4) and (4.5) as follows:

\[
c_{\text{out}} = (x_1 \oplus x_2) \cdot x_3 + x_1 \cdot x_2 = (x_1 \oplus x_2) \cdot x_4 + (x_1 \oplus x_2) \cdot x_i \quad (4.2)
\]

\[
s = x_i \oplus x_2 \oplus x_3 \quad (4.3)
\]

\[
\text{sum} = s \oplus x_4 \oplus c_{\text{in}} = x_1 \oplus x_2 \oplus x_3 \oplus x_4 \oplus c_{\text{in}} \quad (4.4)
\]

\[
carry = (s \oplus x_4) \cdot c_{\text{in}} + s \cdot x_4 = (x_1 \oplus x_2 \oplus x_3 \oplus x_4) \cdot c_{\text{in}} + (x_1 \oplus x_2 \oplus x_3 \oplus x_4) \cdot x_4 \quad (4.5)
\]

The two carry signals, \textit{carry} and \textit{c_{out}} are generated from both the Exclusive OR (XOR) and Exclusive NOR (XNOR) functions of the input signals. The \textit{sum} output is generated by several two-input XOR circuits, some internal signals of which can be used to generate the two carries. Fig. 4.4 shows the logic decomposition of this 4-2 compressor architecture. It is mainly composed of six modules, four of which are XOR circuits and the other two are 2-1 multiplexers. Three special XOR-XNOR modules marked with
"XOR*" generate both the XOR and XNOR signals simultaneously to other modules driven by them. This design has a critical path delay of $3\Delta$, which is $1\Delta$ delay shorter than the conventional implementation. Besides, its outputs feature balanced signal arrival time from each data inputs ($x_1$ to $x_4$), thanks to the special "XOR*" modules.

As both the XOR and XNOR signals are required in the carry generation circuits of Fig. 4.4, circuits capable of co-generating these two signals, as depicted in Fig. 3.4 in Chapter 3, are beneficial to the implementation of the special XOR* modules. Although there is no need for the XOR* module with inputs of $x_3$ and $x_4$ to provide the XNOR output, the same structure as other XOR* modules is still preferred in order to avoid skewed delay paths.

The layout of the proposed new 4-2 compressor is shown in Fig. 4.5. It is based on the novel XOR/XNOR cell of Fig. 3.4(e), the XOR circuit of Fig. 3.6(b) for the generation of the sum, and the carry generation circuit of Fig. 3.7(c) from Chapter 3. It occupies a silicon area of $22\mu m \times 17\mu m$. 

---

![Logic decomposition of 4-2 compressor - 3\Delta](image)
Chapter 4  
Low Voltage, Low Power 4-2 and 5-2 Compressors for Fast Arithmetic Circuits

4.3 5-2 compressor

4.3.1 5-2 compressor architectures

5-2 compressor is another widely used building block for high precision and high speed multipliers. The block diagram of a 5-2 compressor is shown in Fig. 4.6, which has seven inputs and four outputs. Five of the inputs are the primary inputs $x_1, x_2, x_3, x_4$ and $x_5$, and the other two inputs, $c_{in1}$ and $c_{in2}$ receive their values from the neighboring compressor of one binary bit order lower in significance. All the seven inputs have the same weight. The 5-2 compressor generates an output, $sum$ of the same weight as the inputs, and three outputs, $carry, c_{out1}$ and $c_{out2}$ weighted one binary bit order higher. The outputs, $c_{out1}$ and $c_{out2}$ are fed to the neighboring compressor of higher significance. All the 5-2 compressors of different designs abide by the generic equation (4.6):

$$x_1 + x_2 + x_3 + x_4 + x_5 + c_{in1} + c_{in2} = sum + 2 \cdot (carry + c_{out1} + c_{out2}) \quad (4.6)$$

Besides, to speedup the carry save summation of the partial products, the output $c_{out1}$ must be independent of the inputs, $c_{in1}$ and $c_{in2}$, and the output, $c_{out2}$ must be independent of the input, $c_{in2}$. 

---

**Figure 4.5** Layout of the new 4-2 compressor using the proposed XOR-XNOR cell
A simple implementation of the 5-2 compressor is to cascade three full adders in a hierarchical structure, as shown in Fig. 4.7, which has a critical path delay of $6\Delta$.

Fig. 4.8 shows a modified architecture of 5-2 compressor [PRA01], which has a critical path delay of $5\Delta$. The XOR* modules in the figure generate both the XOR and XNOR signal simultaneously, as described in Chapter 3. The style and structure of the circuit share some common attributes as the recently published structural design of full adders [PRA01, RAD01, SHA97, SHA00, SHA02] and 4-2 compressors [GU03(1), PRA01, RAD00].
In spite of the structural differences between the implementations of Fig. 4.7 and Fig. 4.8, the formulae to generate the output signals are essentially derived from the same basic architecture of Fig. 4.7. Each full adder can be logically expressed as:

\[ s_{FA} = a \oplus b \oplus c \]  \hspace{1cm} (4.7)

\[ c_{FA} = (a \oplus b) \cdot c + (a \oplus b) \cdot a \]  \hspace{1cm} (4.8)

where \( a, b, \) and \( c \) are the primary inputs, and \( s_{FA} \) and \( c_{FA} \) are the primary outputs of the full adder.

It follows that the outputs and the internal nodes of Fig. 4.7 can be expressed by the following set of equations.

\[ s_1 = x_1 \oplus x_2 \oplus x_3 \]  \hspace{1cm} (4.9)

\[ c_{out1} = (x_1 \oplus x_2) \cdot x_3 + (x_1 \oplus x_2) \cdot x_1 \]  \hspace{1cm} (4.10)

\[ s_2 = s_1 \oplus x_4 \oplus c_{in1} \]
\[ = x_1 \oplus x_2 \oplus x_3 \oplus x_4 \oplus c_{in1} \]  \hspace{1cm} (4.11)

\[ c_{out2} = (x_1 \oplus s_1) \cdot c_{in} + (x_1 \oplus s_1) \cdot x_1 \]
\[ = (x_1 \oplus x_2 \oplus x_3 \oplus x_4) \cdot c_{in} + (x_1 \oplus x_2 \oplus x_3 \oplus x_4) \cdot x_1 \]  \hspace{1cm} (4.12)
A faster 5-2 compressor proposed in [KWO00] is shown in Fig. 4.9. This architecture uses a different method to generate the outputs, \( c_{\text{out1}} \) and \( c_{\text{out2}} \). Although this architecture produces different output bit patterns in \( c_{\text{out1}} \) and \( c_{\text{out2}} \) for the same input data, it still abides by (4.6). It is claimed to have a delay of 4\( \Delta \).

\[
\begin{align*}
\text{sum} &= s_{5} \oplus x_{5} \oplus c_{\text{in}2} \\
&= x_{1} \oplus x_{2} \oplus x_{3} \oplus x_{4} \oplus x_{5} \oplus c_{\text{in}1} \oplus c_{\text{in}2} \tag{4.13} \\
\text{carry} &= (s_{5} \oplus s_{2}) \cdot c_{\text{in}2} + (x_{5} \oplus s_{2}) \cdot x_{5} \\
&= (x_{1} \oplus x_{2} \oplus x_{3} \oplus x_{4} \oplus x_{5} \oplus c_{\text{in}1}) \cdot c_{\text{in}2} \\
&\quad + (x_{1} \oplus x_{2} \oplus x_{3} \oplus x_{4} \oplus c_{\text{in}1}) \cdot x_{5} \tag{4.14}
\end{align*}
\]

(4.15) - (4.18) show the output functions:
\[
\begin{align*}
\text{sum} &= x_{1} \oplus x_{2} \oplus x_{3} \oplus x_{4} \oplus x_{5} \oplus c_{\text{in}1} \oplus c_{\text{in}2} \tag{4.15} \\
\text{c}_{\text{out1}} &= (x_{1} + x_{2}) \cdot (x_{3} + x_{4}) \tag{4.16} \\
\text{c}_{\text{out2}} &= (x_{1} \oplus x_{2} \oplus x_{3} \oplus x_{4}) \cdot (x_{1} \cdot x_{2} + x_{3} \cdot x_{4}) \\
&\quad + (x_{1} \oplus x_{2} \oplus x_{3} \oplus x_{4}) \cdot c_{\text{in}1} \tag{4.17} \\
\text{carry} &= (x_{1} \oplus x_{2} \oplus x_{3} \oplus x_{4} \oplus x_{5} \oplus c_{\text{in}1}) \cdot x_{5} \\
&\quad + (x_{1} \oplus x_{2} \oplus x_{3} \oplus x_{4} \oplus x_{5} \oplus c_{\text{in}1}) \cdot c_{\text{in}2} \tag{4.18}
\end{align*}
\]
We proposed a new and faster architecture with the same theoretical critical path delay of $4\Delta$. The structure of our 5-2 compressor is shown in Fig. 4.10(a). It uses one less module than the 5-2 compressor of Fig. 4.9. The mechanisms for generating the carry output signals, such as $\text{carry}$, $c_{\text{out1}}$, $c_{\text{out2}}$, are fundamentally different from those of the existing architectures. The carry generator, CGEN1 is used to produce the signal $c_{\text{out1}}$. In anticipation that this carry generator is most likely to be situated in the longest delay path in the tree structured multiplier, it is implemented as a complex gate fed only with the primary inputs. Although several more transistors are used, the speed of this carry generator is faster than that controlled by a XOR* module. The other two carry generators CGEN2, which are used to produce the $\text{carry}$ and $c_{\text{out2}}$, are still controlled by the corresponding XOR* modules. Unlike CGEN1, one of the inputs to CGEN2 is either $c_{\text{in1}}$ or $c_{\text{in2}}$, which comes from an output of the compressor of the preceding stage. Since these signals are often arrived later than the primary inputs, implementing CGEN2 with more costly complex gate as CGEN1 is not necessary as it will not help to reduce the critical path delay of the tree structured multiplier any way.

The carry generators, CGEN2 in Fig. 4.10(a) can also be implemented by multiplexers, as shown in Fig. 4.10(b). CGEN2 of Fig. 4.10(a) is better implemented in complementary CMOS logic style, which offers low capacitive loading to the driving XOR* modules while the MUX-based carry generator modules of Fig. 4.10(b) is more suitable for realization with other logic styles.
The output functions of our proposed architecture in Fig. 4.10 are described by (4.19) – (4.22)

\[
\begin{align*}
\text{sum} &= x_1 \oplus x_2 \oplus x_3 \oplus x_4 \oplus x_5 \oplus c_{in1} \oplus c_{in2} \\
c_{out1} &= (x_1 + x_2) \cdot x_3 + x_1 \cdot x_2 \\
c_{out1} &= (x_4 \oplus x_5) \cdot c_{in1} + x_4 \cdot x_5 \equiv (x_4 \oplus x_5) \cdot c_{in1} + (x_4 \oplus x_5) \cdot x_i \\
carry &= \left( (x_1 \oplus x_2 \oplus x_3) \oplus (x_1 \oplus x_5 \oplus c_{in1}) \right) \cdot c_{in2} \\
&\quad + (x_1 \oplus x_2 \oplus x_3) \cdot (x_4 \oplus x_5 \oplus c_{in1}) \\
&\quad = \left( (x_1 \oplus x_2 \oplus x_3) \oplus (x_1 \oplus x_5 \oplus c_{in1}) \right) \cdot c_{in2} \\
&\quad + (x_1 \oplus x_2 \oplus x_3) \cdot (x_4 \oplus x_5 \oplus c_{in1}) \cdot (x_i \oplus x_j \oplus x_k) 
\end{align*}
\]

Based on the above formulae, it is conjectured that lowering the critical path delay of the 5-2 compressor to 3Δ or lower is almost impossible. However, it is very likely to explore different logic styles at transistor level to design significantly improved low power and high-speed 5-2 compressor cells for instantiation at architectural level. For example, a dual-rail 5-2 compressor [PRA01] is proposed for the architecture of Fig. 4.8, where the XOR modules are implemented as dual-rail multiplexers to improve the performance.
4.3.2 Circuit of the proposed 5-2 compressor architectures

The design of Fig. 3.4(e) is recommended for the implementation of XOR* modules to allow low voltage operation. The XOR* modules do not need strong driving capability for the internal modules driven by them. Therefore, the design of Fig. 3.6(a) without the output buffer can be used for the XOR* module. The sum output stage is implemented with the circuit of Fig. 3.6(b) to assure the drivability. The new complementary CMOS logic style circuit of Fig. 3.8 from Chapter 3, is suggested for the implementation of the CGEN2 module.

The $c_{out}$ output of the CGEN1 module receives the primary inputs of $x_1$ to $x_3$ directly. It can be implemented with the carry generator circuit of the full adder in complementary CMOS logic style, which is shown in Fig. 4.11(a) [NAG96, ZIM97]. This circuit uses two more transistors than the circuit of Fig. 3.8, but it bypasses the XOR and XNOR signals from the preceding XOR* module, which has prevented the delay of the $c_{out}$ generation from being degraded by about $1\Delta$. Fig. 4.11(b) shows the circuit for the CGEN1 module implemented in CPL logic style [ZIM97].

![Figure 4.11 Implementations of the CGEN1 module](image)
The complete layout of the proposed 5-2 compressor of $4\Delta$ is shown in Fig. 4.12.

![Layout of the proposed 5-2 compressor](image)

**Figure 4.12 Layout of the proposed 5-2 compressor**

### 4.4 Simulation results

#### 4.4.1 Simulation environment

The simulations are performed by Nassda Hsim 2.0 tool with the option "HSIMSPEED" set to "0". This option gives the slowest simulation time with the highest accuracy giving results compatible to HSPICE simulation. All the circuits and layouts are targeted for the latest Chartered Semiconductor CSM 0.18μm CMOS technology. Therefore, the circuits are designed and optimized based on this process model.

The simulation environments for the 4-2 compressor and 5-2 compressor circuits are shown in Fig. 4.13. Each input and output pins are cascaded by buffers, which provide a realistic simulation environment reflecting the compressor operation in actual applications. The simulation environments for the 4-2 compressor and 5-2 compressor consist of two cascaded 4-2 compressors and three cascaded 5-2 compressors, respectively. These compressors run in parallel to simulate an actual compressor stage in the CSA tree. More than one compressor are used in the simulation because the critical paths of some data patterns may cross adjacent compressors in the same stage of the CSA...
tree. The dashed lines in Fig. 4.13 indicate the scenario of such potential critical paths. The leftmost compressor of both simulation environments is inspected because it is most probable to have the longest delay. The delay is measured from the earliest input signal reaching 50% of the supply voltage to the latest output signal reaching 50% of the supply voltage for each input cycle. The worst-case delay is the largest delay among all input data.

For each circuit, 1024 data are randomly generated by MATLAB to feed into the circuits as input stimuli. The circuits under test are simulated at various supply voltages, ranging from 0.6V to 3.3V. Two simulation frequencies (the rate at which data patterns are fed) are used. When the supply voltage is larger than 1.0V, the computation rate of 100 MHz is used. When the supply voltage is equal to or lower than 1.0V, 10 MHz computation rate is used. It should be noted that the simulation frequency is not the maximum operating frequency of the compressors. In fact, the compressors simulated are capable of operating correctly at much higher frequency than the simulation frequency. The average
power consumption of the leftmost compressor is measured for every supply voltage, with the power consumed by the additional buffers excluded from the average power consumption calculation.

### 4.4.2 Simulation results of 4-2 compressors

Nine different 4-2 compressor designs are simulated. The first circuit is a full complementary CMOS logic style implementation of Fig. 4.3, which is used as the benchmark for low voltage operation. The second to the sixth compressors use the same architecture of Fig. 4.4, with the XOR modules implemented by the circuit of Fig. 3.6(b) and the MUX modules implemented by the circuit of Fig. 3.7(b) from Chapter 3. They differ mainly in the circuit implementation of the XOR* modules. The XOR* modules of these five compressors are respectively implemented by the circuits of Fig. 3.4(a) to Fig. 3.4(e) from Chapter 3. The seventh compressor is a hybrid design employing the circuit of Fig. 3.4(e) for its XOR* modules, the circuit of Fig. 3.6(b) for its XOR modules, and the circuit of Fig. 3.7(c) as its MUX modules. The eighth and ninth compressors are implemented by CPL and DPL logic styles, using the circuits of Fig. 3.7(d) and 3.4(f), and Fig. 3.7(e) and 3.4(f), respectively as their building blocks. The configurations of the nine compressors are listed in Table 4.1. For brevity, only the figure numbers of the circuits, e.g., 3.4(a) instead of Fig. 3.4(a), are shown in the table. The last column in Table 4.1 shows the lowest operable voltage for each 4-2 compressor obtained from the simulation, below which the circuit fails to function correctly. The full simulation results for the performance of all the compressors at different supply voltages are tabulated in Table 4.2 to Table 4.4. Two best performances at each supply voltage are highlighted in italic and bold print for ease of comparison.
Chapter 4

Low Voltage, Low Power 4-2 and 5-2 Compressors for Fast Arithmetic Circuits

Table 4.1 Configurations of the simulated 4-2 compressors

<table>
<thead>
<tr>
<th>S/N</th>
<th>Design</th>
<th>XOR-XNOR</th>
<th>XOR</th>
<th>MUX</th>
<th>Min. VDD (V)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>4-2 CMOS</td>
<td>3.4(a)</td>
<td>3.6(b)</td>
<td>3.7(b)</td>
<td>0.9</td>
</tr>
<tr>
<td>2</td>
<td>4-2_a</td>
<td>3.4(b)</td>
<td>3.6(b)</td>
<td>3.7(b)</td>
<td>0.6</td>
</tr>
<tr>
<td>3</td>
<td>4-2_b</td>
<td>3.4(c)</td>
<td>3.6(b)</td>
<td>3.7(b)</td>
<td>2.1</td>
</tr>
<tr>
<td>4</td>
<td>4-2_c</td>
<td>3.4(d)</td>
<td>3.6(b)</td>
<td>3.7(b)</td>
<td>0.8</td>
</tr>
<tr>
<td>5</td>
<td>4-2_d</td>
<td>3.4(e)</td>
<td>3.6(b)</td>
<td>3.7(b)</td>
<td>0.6</td>
</tr>
<tr>
<td>6</td>
<td>4-2_e</td>
<td>3.4(e)</td>
<td>3.6(b)</td>
<td>3.7(c)</td>
<td>0.6</td>
</tr>
<tr>
<td>7</td>
<td>4-2_hybrid</td>
<td>3.4(d)</td>
<td>3.7(d)</td>
<td>3.4(f)</td>
<td>0.6</td>
</tr>
<tr>
<td>8</td>
<td>4-2 CPL</td>
<td>3.7(d) + 3.4(f)</td>
<td>3.7(d) + 3.4(e)</td>
<td>3.7(d)</td>
<td>0.6</td>
</tr>
<tr>
<td>9</td>
<td>4-2_DPL</td>
<td>3.7(e) + 3.4(f)</td>
<td>3.7(e) + 3.4(f)</td>
<td>3.7(e)</td>
<td>0.6</td>
</tr>
</tbody>
</table>

Table 4.2 Comparison of delay (ns) of 4-2 compressors

<table>
<thead>
<tr>
<th>Design</th>
<th>Delay (ns)</th>
<th>0.6V</th>
<th>0.7V</th>
<th>0.8V</th>
<th>0.9V</th>
<th>1.0V</th>
<th>1.2V</th>
<th>1.5V</th>
<th>1.8V</th>
<th>2.1V</th>
<th>2.5V</th>
<th>3.3V</th>
</tr>
</thead>
<tbody>
<tr>
<td>4-2_CMOS</td>
<td>6.289</td>
<td>3.184</td>
<td>2.051</td>
<td>1.484</td>
<td>1.727</td>
<td>0.811</td>
<td>0.671</td>
<td>0.574</td>
<td>0.457</td>
<td>0.394</td>
<td>0.36</td>
<td>0.31</td>
</tr>
<tr>
<td>4-2_a</td>
<td>12.69</td>
<td>5.989</td>
<td>1.912</td>
<td>0.745</td>
<td>0.458</td>
<td>0.36</td>
<td>0.3</td>
<td>0.26</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4-2_b</td>
<td>8.032</td>
<td>4.05</td>
<td>2.538</td>
<td>1.798</td>
<td>1.364</td>
<td>0.898</td>
<td>0.59</td>
<td>0.46</td>
<td>0.401</td>
<td>0.357</td>
<td>0.365</td>
<td>0.35</td>
</tr>
<tr>
<td>4-2_c</td>
<td>101</td>
<td>0.75</td>
<td>0.52</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4-2_d</td>
<td>34.66</td>
<td>7.043</td>
<td>3.696</td>
<td>1.54</td>
<td>0.789</td>
<td>0.548</td>
<td>0.432</td>
<td>0.352</td>
<td>0.33</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4-2_e</td>
<td>6.825</td>
<td>3.43</td>
<td>2.2</td>
<td>1.56</td>
<td>1.19</td>
<td>0.804</td>
<td>0.554</td>
<td>0.455</td>
<td>0.403</td>
<td>0.365</td>
<td>0.326</td>
<td></td>
</tr>
<tr>
<td>4-2_hybrid</td>
<td>7.11</td>
<td>3.724</td>
<td>2.425</td>
<td>1.736</td>
<td>1.357</td>
<td>0.934</td>
<td>0.656</td>
<td>0.54</td>
<td>0.47</td>
<td>0.42</td>
<td>0.35</td>
<td></td>
</tr>
<tr>
<td>4-2_CPL</td>
<td>6.28</td>
<td>3.04</td>
<td>1.913</td>
<td>1.375</td>
<td>1.07</td>
<td>0.74</td>
<td>0.53</td>
<td>0.44</td>
<td>0.39</td>
<td>0.35</td>
<td>0.32</td>
<td></td>
</tr>
<tr>
<td>4-2_DPL</td>
<td>6.735</td>
<td>3.323</td>
<td>2.04</td>
<td>1.402</td>
<td>1.04</td>
<td>0.69</td>
<td>0.48</td>
<td>0.38</td>
<td>0.33</td>
<td>0.29</td>
<td>0.255</td>
<td></td>
</tr>
</tbody>
</table>

Table 4.3 Comparison of power (µW) of 4-2 compressors

<table>
<thead>
<tr>
<th>Design</th>
<th>Power (µW)</th>
<th>0.6V</th>
<th>0.7V</th>
<th>0.8V</th>
<th>0.9V</th>
<th>1.0V</th>
<th>1.2V</th>
<th>1.5V</th>
<th>1.8V</th>
<th>2.1V</th>
<th>2.5V</th>
<th>3.3V</th>
</tr>
</thead>
<tbody>
<tr>
<td>4-2_CMOS</td>
<td>0.1669</td>
<td>0.2327</td>
<td>0.3161</td>
<td>0.4049</td>
<td>0.4801</td>
<td>0.7601</td>
<td>1.332</td>
<td>2.101</td>
<td>3.228</td>
<td>5.025</td>
<td>9.8</td>
<td></td>
</tr>
<tr>
<td>4-2_a</td>
<td>0.3299</td>
<td>0.409</td>
<td>0.6763</td>
<td>1.326</td>
<td>2.67</td>
<td>5.259</td>
<td>12.73</td>
<td>70.66</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4-2_b</td>
<td>0.1142</td>
<td>0.1573</td>
<td>0.2147</td>
<td>0.2698</td>
<td>0.333</td>
<td>0.5098</td>
<td>0.9125</td>
<td>1.496</td>
<td>2.38</td>
<td>3.875</td>
<td>8.23</td>
<td></td>
</tr>
<tr>
<td>4-2_c</td>
<td>0.2614</td>
<td>0.3158</td>
<td>0.3681</td>
<td>0.5424</td>
<td>0.8916</td>
<td>1.325</td>
<td>1.984</td>
<td>3.002</td>
<td>6.028</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4-2_d</td>
<td>0.1123</td>
<td>0.1529</td>
<td>0.2039</td>
<td>0.264</td>
<td>0.3102</td>
<td>0.4781</td>
<td>0.8193</td>
<td>1.258</td>
<td>1.885</td>
<td>2.944</td>
<td>6.023</td>
<td></td>
</tr>
<tr>
<td>4-2_e</td>
<td>0.1517</td>
<td>0.2074</td>
<td>0.2649</td>
<td>0.3169</td>
<td>0.4936</td>
<td>0.8413</td>
<td>1.275</td>
<td>1.93</td>
<td>2.998</td>
<td>6.143</td>
<td></td>
<td></td>
</tr>
<tr>
<td>4-2_hybrid</td>
<td>0.1266</td>
<td>0.1778</td>
<td>0.2323</td>
<td>0.2944</td>
<td>0.3565</td>
<td>0.5657</td>
<td>0.9617</td>
<td>1.497</td>
<td>2.338</td>
<td>3.617</td>
<td>7.863</td>
<td></td>
</tr>
<tr>
<td>4-2_CPL</td>
<td>0.1186</td>
<td>0.1606</td>
<td>0.2068</td>
<td>0.2526</td>
<td>0.3058</td>
<td>0.4839</td>
<td>0.8322</td>
<td>1.301</td>
<td>2.031</td>
<td>3.461</td>
<td>8.409</td>
<td></td>
</tr>
</tbody>
</table>

91
Table 4.4 Comparison of power efficiency (fJ) of 4-2 compressors

<table>
<thead>
<tr>
<th>PDP (fJ)</th>
<th>0.6V</th>
<th>0.7V</th>
<th>0.8V</th>
<th>0.9V</th>
<th>1.0V</th>
<th>1.2V</th>
<th>1.5V</th>
<th>1.8V</th>
<th>2.1V</th>
<th>2.5V</th>
<th>3.3V</th>
</tr>
</thead>
<tbody>
<tr>
<td>4-2_CMOS</td>
<td>1.050</td>
<td>0.7409</td>
<td>0.6483</td>
<td>0.6009</td>
<td>0.5569</td>
<td>0.6164</td>
<td>0.7646</td>
<td>0.9602</td>
<td>1.272</td>
<td>1.809</td>
<td>3.038</td>
</tr>
<tr>
<td>4-2_a</td>
<td>4.186</td>
<td>2.450</td>
<td>1.293</td>
<td>0.9879</td>
<td>1.223</td>
<td>1.893</td>
<td>3.819</td>
<td>18.37</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4-2_b</td>
<td>0.9173</td>
<td>0.6371</td>
<td>0.5449</td>
<td>0.4851</td>
<td>0.4542</td>
<td>0.4578</td>
<td>0.5384</td>
<td>0.6882</td>
<td>0.9544</td>
<td>1.383</td>
<td>2.510</td>
</tr>
<tr>
<td>4-2_c</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4-2_d</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4-2_e</td>
<td>0.7664</td>
<td>0.5244</td>
<td>0.4486</td>
<td>0.4118</td>
<td>0.3691</td>
<td>0.3844</td>
<td>0.4539</td>
<td>0.5724</td>
<td>0.7597</td>
<td>1.057</td>
<td>1.989</td>
</tr>
<tr>
<td>4-2_hybrid</td>
<td>0.8084</td>
<td>0.5780</td>
<td>0.5029</td>
<td>0.4599</td>
<td>0.4300</td>
<td>0.4610</td>
<td>0.5519</td>
<td>0.6885</td>
<td>0.9071</td>
<td>1.259</td>
<td>2.150</td>
</tr>
<tr>
<td>4-2_CPL</td>
<td>0.7950</td>
<td>0.5405</td>
<td>0.4444</td>
<td>0.4048</td>
<td>0.3815</td>
<td>0.4186</td>
<td>0.5097</td>
<td>0.6587</td>
<td>0.9118</td>
<td>1.266</td>
<td>2.516</td>
</tr>
<tr>
<td>4-2_DPL</td>
<td>0.7988</td>
<td>0.5337</td>
<td>0.4219</td>
<td>0.3541</td>
<td>0.3180</td>
<td>0.3339</td>
<td>0.3995</td>
<td>0.4944</td>
<td>0.6702</td>
<td>1.004</td>
<td>2.144</td>
</tr>
</tbody>
</table>

To investigate the performance variations of the same 4-2 architecture due to different implementations of the XOR-XNOR module, the worst case delay, power dissipation and power-delay product of Designs 2 - 6 are charted in Fig. 4.14. It is evident that the worst-case delays at low supply voltages of Designs 2 and 5 are much longer than those of the other designs. Designs 2 and 4 consume more power. Consequently, Designs 3 and 6 perform significantly better than the other designs in terms of the power efficiency. In fact, these are the only two designs of Fig. 4.14 that are able to operate down to 0.6V. The circuits used to implement the XOR* modules for Designs 3 and 6 are respectively, Fig. 3.4(b) and our proposed Fig. 3.4(e).
The worst case delay, power dissipation and power-delay product of all designs capable of functioning down to the lowest supply voltage of 0.6V, including our proposed hybrid 4-2 compressor are charted in Fig. 4.15. Designs 1, 8 and 9 featuring respectively the CMOS, CPL and DPL logic styles have comparatively shorter worst-case delay. However, they and Design 3 dissipate notably higher power than Designs 6 and 7, which use our proposed XOR* cell of Fig. 3.4(e). For example, Design 8 (CPL) consumes 12.7% more power at 0.6V than Design 6 (4-2_e). The average power consumption
Chapter 4

Low Voltage, Low Power 4-2 and 5-2 Compressors for Fast Arithmetic Circuits

exacerbates with increasing supply voltages to a surplus of 14.9% at 1.0V, 19.0% at 1.8V
and 30.5% at 3.3V. The power efficiency (PDP) of Designs 6 and 7 are comparable to
those circuits implemented in CPL and DPL styles. Both CPL and DPL logic styles
require the generation of dual-rail signals for each primary input and output, incurring
almost twice as many interconnecting lines as the other designs. Taking into account the
substantial capacitive load due to the wiring overhead, Designs 6 and 7 will not be
competitive for the implementation of 4-2 compressors in large parallel multipliers.

(a) Worst case delay

(b) Average power
4.4.3 Simulation Results of 5-2 Compressors

Thirteen different 5-2 compressor designs are simulated. These selected designs are all operable from 0.6V to 3.3V and their circuit configurations are tabulated in Table 4.5. Each design is named according to its base architecture and a postfix indicating the types of circuits employed for the three constituent modules. For example, the name 5del_ebb of Design 2 implies that it is a 5Δ 5-2 compressor constructed by the circuits of Fig. 3.4(e), Fig. 3.6(b) and Fig. 3.7(b) from Chapter 3 for its XOR*, XOR and MUX modules, respectively. The first five designs are 5-2 compressors of 5Δ based on the architecture of Fig. 4.8. The four designs that followed are the 5-2 compressors of 4Δ based on the architecture of Fig. 4.9 proposed by Kwon et. al. [KWO00]. Designs 6 – 8 use the optimized complementary CMOS logic style to generate the OR-AND and AND-OR functions while Design 9 (kwon_cpl) generates these two functions with CPL logic style. The last four designs are the 5-2 compressors of 4Δ using the proposed architecture of Fig. 4.10. Designs 10, 11 and 13 use the circuits of Fig. 4.10(b) and Fig. 4.11(a) for the CGEN1 modules, and the circuit of Fig. 4.7(b) for the MUX modules. Design 12 is based on the architecture of Fig. 4.10(a), with a hybrid composition of optimally designed circuits of different logic styles. It uses the proposed XOR* cell of Fig. 3.4(e), the pass-
transistor style XOR gate of Fig. 3.6(b), the XOR\(^*\) circuit of Fig. 3.6(a), the complementary CMOS styled CGEN1 circuit of Fig. 4.11(a), and the CGEN2 circuit of Fig. 3.8. Design 13 (CPL) is based on the same architecture as Designs 10 and 11, but uses the circuits implemented in CPL logic style for the XOR\(^*\), XOR, XOR\(^\wedge\) and MUX modules. Its CGEN1 module is implemented with the CPL circuit of Fig. 4.11(b).

**Table 4.5 Configurations of the simulated 5-2 compressors**

<table>
<thead>
<tr>
<th>S/N</th>
<th>Design</th>
<th>Architecture</th>
<th>XOR(^*)</th>
<th>XOR, XOR(^\wedge)</th>
<th>MUX, CGEN</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>5del_bbb</td>
<td>4.8</td>
<td>3.4(b)</td>
<td>3.6(b)</td>
<td>3.7(b)</td>
</tr>
<tr>
<td>2</td>
<td>5del_ebb</td>
<td>4.8</td>
<td>3.4(e)</td>
<td>3.6(b)</td>
<td>3.7(b)</td>
</tr>
<tr>
<td>3</td>
<td>5del_hybrid</td>
<td>4.8</td>
<td>3.4(e)</td>
<td>3.6(b)</td>
<td>3.7(c)</td>
</tr>
<tr>
<td>4</td>
<td>5del_cpl</td>
<td>4.8</td>
<td>3.7(d) + 3.4(f)</td>
<td>3.7(d) + 3.4(f)</td>
<td>3.7(d)</td>
</tr>
<tr>
<td>5</td>
<td>5del_dpl</td>
<td>4.8</td>
<td>3.7(e) + 3.4(f)</td>
<td>3.7(e) + 3.4(f)</td>
<td>3.7(e)</td>
</tr>
<tr>
<td>6</td>
<td>kwon_bbb</td>
<td>4.9</td>
<td>3.4(b)</td>
<td>3.6(b)</td>
<td>3.7(b)</td>
</tr>
<tr>
<td>7</td>
<td>kwon_ebb</td>
<td>4.9</td>
<td>3.4(e)</td>
<td>3.6(b)</td>
<td>3.7(b)</td>
</tr>
<tr>
<td>8</td>
<td>kwon_ebc</td>
<td>4.9</td>
<td>3.4(e)</td>
<td>3.6(b)</td>
<td>3.7(c)</td>
</tr>
<tr>
<td>9</td>
<td>kwon_cpl</td>
<td>4.9</td>
<td>3.7(d) + 3.4(f)</td>
<td>3.7(d) + 3.4(f)</td>
<td>3.7(d)</td>
</tr>
<tr>
<td>10</td>
<td>4del_bbb</td>
<td>4.10(b)</td>
<td>3.4(b)</td>
<td>3.6(b), 3.6(a)</td>
<td>4.11(a), 3.7(b)</td>
</tr>
<tr>
<td>11</td>
<td>4del_ebb</td>
<td>4.10(b)</td>
<td>3.4(e)</td>
<td>3.6(b), 3.6(a)</td>
<td>4.11(a), 3.7(b)</td>
</tr>
<tr>
<td>12</td>
<td>4del_hybrid</td>
<td>4.10(a)</td>
<td>3.4(e)</td>
<td>3.6(b), 3.6(a)</td>
<td>4.11(a), 3.8</td>
</tr>
<tr>
<td>13</td>
<td>4del_cpl</td>
<td>4.10(b)</td>
<td>3.7(d) + 3.4(f)</td>
<td>3.7(d) + 3.4(f)</td>
<td>4.11(b), 3.7(d)</td>
</tr>
</tbody>
</table>

The simulation results of the delay, power and power efficiency of all the compressors are tabulated in Table 4.6 to Table 4.8. Two best performances at each supply voltage are printed in bold and italic. The performances of all 5-2 compressors based on the 5A architecture are charted in Fig. 4.16 for comparison. The CPL (5del_cpl) and DPL (5del_dpl) designs have the best worst-case delay and power efficiency. Among the non dual-rail designs, 5del_bbb consumes more power and 5del_hybrid has the longest delay. Design 5del_ebb provides the best trade-off between delay, power and power efficiency among the three non dual-rail compressors. Considering all aspects of the performance, this architecture is best implemented with either CPL or DPL circuits.
### Table 4.6 Comparison of delay (ns) of 5-2 compressors

<table>
<thead>
<tr>
<th>Delay (ns)</th>
<th>0.6V</th>
<th>0.7V</th>
<th>0.8V</th>
<th>0.9V</th>
<th>1.0V</th>
<th>1.2V</th>
<th>1.5V</th>
<th>1.8V</th>
<th>2.1V</th>
<th>2.5V</th>
<th>3.3V</th>
</tr>
</thead>
<tbody>
<tr>
<td>5del_bbb</td>
<td>11.42</td>
<td>5.722</td>
<td>3.583</td>
<td>2.52</td>
<td>1.909</td>
<td>1.262</td>
<td>0.839</td>
<td>0.68</td>
<td>0.588</td>
<td>0.518</td>
<td>0.45</td>
</tr>
<tr>
<td>5del_ebb</td>
<td>11.73</td>
<td>5.76</td>
<td>3.59</td>
<td>2.54</td>
<td>1.94</td>
<td>1.302</td>
<td>0.913</td>
<td>0.747</td>
<td>0.663</td>
<td>0.607</td>
<td>0.568</td>
</tr>
<tr>
<td>5del_hybrid</td>
<td>12.15</td>
<td>6.192</td>
<td>3.889</td>
<td>2.749</td>
<td>2.114</td>
<td>1.443</td>
<td>0.998</td>
<td>0.796</td>
<td>0.685</td>
<td>0.607</td>
<td>0.549</td>
</tr>
<tr>
<td>5del_cpl</td>
<td>8.11</td>
<td>4.35</td>
<td>2.65</td>
<td>1.83</td>
<td>1.42</td>
<td>0.99</td>
<td>0.69</td>
<td>0.57</td>
<td>0.49</td>
<td>0.45</td>
<td>0.41</td>
</tr>
<tr>
<td>5del_dpl</td>
<td>9.87</td>
<td>4.35</td>
<td>2.65</td>
<td>1.83</td>
<td>1.371</td>
<td>0.92</td>
<td>0.64</td>
<td>0.52</td>
<td>0.45</td>
<td>0.39</td>
<td>0.35</td>
</tr>
<tr>
<td>kwon_bbb</td>
<td>10.35</td>
<td>5.32</td>
<td>3.37</td>
<td>2.36</td>
<td>1.78</td>
<td>1.13</td>
<td>0.7978</td>
<td>0.6554</td>
<td>0.5668</td>
<td>0.4972</td>
<td>0.413</td>
</tr>
<tr>
<td>kwon_ebb</td>
<td>11.57</td>
<td>5.827</td>
<td>3.47</td>
<td>2.47</td>
<td>1.9</td>
<td>1.29</td>
<td>0.9087</td>
<td>0.7412</td>
<td>0.651</td>
<td>0.598</td>
<td>0.559</td>
</tr>
<tr>
<td>kwon_ebc</td>
<td>11.68</td>
<td>5.905</td>
<td>3.66</td>
<td>2.588</td>
<td>1.987</td>
<td>1.349</td>
<td>0.945</td>
<td>0.768</td>
<td>0.673</td>
<td>0.607</td>
<td>0.553</td>
</tr>
<tr>
<td>kwon_cpl</td>
<td>8.92</td>
<td>4.351</td>
<td>2.749</td>
<td>1.974</td>
<td>1.537</td>
<td>1.094</td>
<td>0.8109</td>
<td>0.6733</td>
<td>0.592</td>
<td>0.5437</td>
<td>0.4957</td>
</tr>
<tr>
<td>4del_bbb</td>
<td>12.33</td>
<td>6.11</td>
<td>3.78</td>
<td>2.63</td>
<td>2.01</td>
<td>1.34</td>
<td>0.91</td>
<td>0.73</td>
<td>0.62</td>
<td>0.55</td>
<td>0.46</td>
</tr>
<tr>
<td>4del_ebb</td>
<td>10.31</td>
<td>5.26</td>
<td>3.414</td>
<td>2.436</td>
<td>1.842</td>
<td>1.183</td>
<td>0.79</td>
<td>0.62</td>
<td>0.53</td>
<td>0.46</td>
<td>0.39</td>
</tr>
<tr>
<td>4del_hybrid</td>
<td>10.11</td>
<td>5.24</td>
<td>3.36</td>
<td>2.41</td>
<td>1.86</td>
<td>1.27</td>
<td>0.87</td>
<td>0.695</td>
<td>0.592</td>
<td>0.511</td>
<td>0.425</td>
</tr>
<tr>
<td>4del_cpl</td>
<td>10.14</td>
<td>4.66</td>
<td>2.92</td>
<td>2.07</td>
<td>1.62</td>
<td>1.11</td>
<td>0.8</td>
<td>0.66</td>
<td>0.57</td>
<td>0.51</td>
<td>0.47</td>
</tr>
</tbody>
</table>

### Table 4.7 Comparison of power (μW) of 5-2 compressors

<table>
<thead>
<tr>
<th>Power (μW)</th>
<th>0.6V</th>
<th>0.7V</th>
<th>0.8V</th>
<th>0.9V</th>
<th>1.0V</th>
<th>1.2V</th>
<th>1.5V</th>
<th>1.8V</th>
<th>2.1V</th>
<th>2.5V</th>
<th>3.3V</th>
</tr>
</thead>
<tbody>
<tr>
<td>5del_bbb</td>
<td>0.215</td>
<td>0.2947</td>
<td>0.398</td>
<td>0.51</td>
<td>0.6155</td>
<td>0.9545</td>
<td>1.706</td>
<td>2.734</td>
<td>4.247</td>
<td>6.932</td>
<td>15.15</td>
</tr>
<tr>
<td>5del_ebb</td>
<td>0.2102</td>
<td>0.2843</td>
<td>0.3797</td>
<td>0.4815</td>
<td>0.5752</td>
<td>0.8836</td>
<td>1.497</td>
<td>2.317</td>
<td>3.451</td>
<td>5.413</td>
<td>11.22</td>
</tr>
<tr>
<td>5del_hybrid</td>
<td>0.2128</td>
<td>0.2901</td>
<td>0.3848</td>
<td>0.4942</td>
<td>0.5938</td>
<td>0.9193</td>
<td>1.546</td>
<td>2.385</td>
<td>3.539</td>
<td>5.475</td>
<td>11.39</td>
</tr>
<tr>
<td>5del_cpl</td>
<td>0.2164</td>
<td>0.3015</td>
<td>0.3942</td>
<td>0.4974</td>
<td>0.596</td>
<td>0.9258</td>
<td>1.554</td>
<td>2.415</td>
<td>3.642</td>
<td>5.788</td>
<td>12.23</td>
</tr>
<tr>
<td>5del_dpl</td>
<td>0.1952</td>
<td>0.266</td>
<td>0.3396</td>
<td>0.4117</td>
<td>0.5007</td>
<td>0.7817</td>
<td>1.315</td>
<td>2.065</td>
<td>3.194</td>
<td>5.224</td>
<td>12.33</td>
</tr>
<tr>
<td>kwon_bbb</td>
<td>0.2106</td>
<td>0.2929</td>
<td>0.3929</td>
<td>0.5026</td>
<td>0.6037</td>
<td>0.9628</td>
<td>1.666</td>
<td>2.711</td>
<td>4.161</td>
<td>6.932</td>
<td>14.88</td>
</tr>
<tr>
<td>kwon_ebb</td>
<td>0.2094</td>
<td>0.2855</td>
<td>0.3817</td>
<td>0.4866</td>
<td>0.586</td>
<td>0.9048</td>
<td>1.519</td>
<td>2.4</td>
<td>3.66</td>
<td>5.827</td>
<td>12.21</td>
</tr>
<tr>
<td>kwon_ebc</td>
<td>0.2079</td>
<td>0.2844</td>
<td>0.3803</td>
<td>0.4852</td>
<td>0.584</td>
<td>0.91</td>
<td>1.539</td>
<td>2.44</td>
<td>3.65</td>
<td>5.788</td>
<td>12.12</td>
</tr>
<tr>
<td>kwon_cpl</td>
<td>0.2629</td>
<td>0.3667</td>
<td>0.4742</td>
<td>0.6046</td>
<td>0.7184</td>
<td>1.14</td>
<td>1.473</td>
<td>2.922</td>
<td>4.429</td>
<td>7.134</td>
<td>15</td>
</tr>
<tr>
<td>4del_bbb</td>
<td>0.1684</td>
<td>0.234</td>
<td>0.314</td>
<td>0.4039</td>
<td>0.4865</td>
<td>0.7566</td>
<td>1.332</td>
<td>2.132</td>
<td>3.301</td>
<td>5.363</td>
<td>11.33</td>
</tr>
<tr>
<td>4del_ebb</td>
<td>0.1549</td>
<td>0.2105</td>
<td>0.2845</td>
<td>0.3598</td>
<td>0.4291</td>
<td>0.6767</td>
<td>1.137</td>
<td>1.781</td>
<td>2.695</td>
<td>4.353</td>
<td>9.02</td>
</tr>
<tr>
<td>4del_hybrid</td>
<td>0.151</td>
<td>0.207</td>
<td>0.2805</td>
<td>0.3561</td>
<td>0.4271</td>
<td>0.6696</td>
<td>1.142</td>
<td>1.757</td>
<td>2.692</td>
<td>4.208</td>
<td>9.05</td>
</tr>
<tr>
<td>4del_cpl</td>
<td>0.2262</td>
<td>0.3158</td>
<td>0.4161</td>
<td>0.5268</td>
<td>0.6578</td>
<td>0.9911</td>
<td>1.701</td>
<td>2.684</td>
<td>4.031</td>
<td>6.338</td>
<td>13.67</td>
</tr>
</tbody>
</table>

ATTENTION: The Singapore Copyright Act applies to the use of this document. Nanyang Technological University Library
### Table 4.8 Comparison of power efficiency (f) of 5-2 compressors

<table>
<thead>
<tr>
<th>PDP (f)</th>
<th>0.6V</th>
<th>0.7V</th>
<th>0.8V</th>
<th>0.9V</th>
<th>1.0V</th>
<th>1.2V</th>
<th>1.5V</th>
<th>1.8V</th>
<th>2.1V</th>
<th>2.5V</th>
<th>3.3V</th>
</tr>
</thead>
<tbody>
<tr>
<td>5del_ebb</td>
<td>2.466</td>
<td>1.638</td>
<td>1.363</td>
<td>1.223</td>
<td>1.116</td>
<td>1.150</td>
<td>1.367</td>
<td>1.731</td>
<td>2.288</td>
<td>3.286</td>
<td>6.373</td>
</tr>
<tr>
<td>5del_hybrid</td>
<td>2.586</td>
<td>1.796</td>
<td>1.496</td>
<td>1.359</td>
<td>1.255</td>
<td>1.327</td>
<td>1.543</td>
<td>1.898</td>
<td>2.424</td>
<td>3.323</td>
<td>6.253</td>
</tr>
<tr>
<td>5del_cpl</td>
<td>1.755</td>
<td>1.203</td>
<td>0.9894</td>
<td>0.9102</td>
<td>0.8463</td>
<td>0.9165</td>
<td>1.072</td>
<td>1.377</td>
<td>1.785</td>
<td>2.605</td>
<td>5.014</td>
</tr>
<tr>
<td>5del_dpl</td>
<td>1.751</td>
<td>1.157</td>
<td>0.8999</td>
<td>0.7534</td>
<td>0.6865</td>
<td>0.7192</td>
<td>0.8416</td>
<td>1.074</td>
<td>1.437</td>
<td>2.076</td>
<td>4.316</td>
</tr>
<tr>
<td>kwon_bbb</td>
<td>2.179</td>
<td>1.558</td>
<td>1.324</td>
<td>1.186</td>
<td>1.075</td>
<td>1.088</td>
<td>1.329</td>
<td>1.777</td>
<td>2.358</td>
<td>3.447</td>
<td>6.145</td>
</tr>
<tr>
<td>kwon_cpl</td>
<td>2.429</td>
<td>1.679</td>
<td>1.392</td>
<td>1.256</td>
<td>1.160</td>
<td>1.228</td>
<td>1.454</td>
<td>1.874</td>
<td>2.456</td>
<td>3.513</td>
<td>6.702</td>
</tr>
<tr>
<td>4del_bbb</td>
<td>2.345</td>
<td>1.596</td>
<td>1.304</td>
<td>1.193</td>
<td>1.104</td>
<td>1.248</td>
<td>1.519</td>
<td>1.967</td>
<td>2.622</td>
<td>3.879</td>
<td>7.436</td>
</tr>
<tr>
<td>4del_ebb</td>
<td>2.076</td>
<td>1.430</td>
<td>1.187</td>
<td>1.062</td>
<td>0.9779</td>
<td>1.014</td>
<td>1.212</td>
<td>1.556</td>
<td>2.047</td>
<td>2.950</td>
<td>5.212</td>
</tr>
<tr>
<td>4del_hybrid</td>
<td>1.597</td>
<td>1.107</td>
<td>0.9713</td>
<td>0.8765</td>
<td>0.7904</td>
<td>0.8005</td>
<td>0.8982</td>
<td>1.104</td>
<td>1.428</td>
<td>2.002</td>
<td>3.518</td>
</tr>
<tr>
<td>4del_cpl</td>
<td>1.527</td>
<td>1.085</td>
<td>0.9425</td>
<td>0.8583</td>
<td>0.7945</td>
<td>0.8504</td>
<td>0.9935</td>
<td>1.221</td>
<td>1.594</td>
<td>2.201</td>
<td>3.846</td>
</tr>
<tr>
<td>4del_dpl</td>
<td>2.293</td>
<td>1.472</td>
<td>1.215</td>
<td>1.090</td>
<td>1.066</td>
<td>1.100</td>
<td>1.361</td>
<td>1.771</td>
<td>2.298</td>
<td>3.232</td>
<td>6.425</td>
</tr>
</tbody>
</table>

(a) Worst case delay
Fig. 4.17 compares the performances of various 4Δ 5-2 compressors built upon the architecture proposed by Kwon et. al. [KWO00]. The design implemented with CPL circuits continues to perform well in delay than the non dual-rail designs as in the previous comparison but its overall performance is not necessarily better than the other designs this time. The problem lies in its high power dissipation. Designs kwon_ebb and kwon_ebc consume about 2% to 18% lesser power than the other two designs, but they
are slower too. In terms of power efficiency, Design kwon_bbb outperforms the other two non-CPL designs, which have very similar performance.
Fig. 4.17 shows the performances of several circuits built around our proposed $4\Delta$ 5-2 compressor architecture. Both 4del_ebb and 4del_hybrid outperform 4del_bbb and 4del_cpl remarkably. Their speeds are comparable to the circuit implemented with CPL logic style. They consume 32% to 35% lesser power than 4del_cpl and 10% to 20% lesser power than 4del_bbb. At voltage higher than 1.0V, 4del_ebb has a slight edge over 4del_hybrid, whereas at voltage lower than 1.0V, 4del_hybrid performs slightly better.
Performance differences due to architectural difference are also studied by comparing different compressor architectures using identical configuration for the same anatomized modules. The performances of three 5-2 compressor architectures with configurations of bbb, ebb, hybrid and CPL are respectively charted in Fig. 4.19 - 4.22. Design 5del_dpl is also added to Fig. 4.22 for comparison. Our proposed architecture has the best performance among all architectures implemented with non dual-rail logic styles. For modules implemented with CPL or DPL logic style, the architecture of Fig. 4.8 is better.
Fig. 4.19 compares the architectures with bbb configuration. The results show that the average power of our proposed 4A architecture is 19% to 24% lesser than Kwon’s architecture of Fig. 4.9, and 20% to 25% lesser than the 5A architecture of Fig. 4.8. Although it is slower, its power-delay product is still far lower than the other architectures.

(a) Worst case delay

(b) Average power
Fig. 4.20 shows that, with the configuration ebb, both the average power and worst-case delay of our proposed 4-2 architecture are superior to Kwon's architecture and the 5-2 architecture. As a result, its power efficiency performance is 27% to 48% better than Kwon's and 28% to 45% better than the 5-2 architecture.
Chapter 4  
Low Voltage, Low Power 4-2 and 5-2 Compressors for Fast Arithmetic Circuits

Fig. 4.20  Performances of different 5-2 compressor architectures with ebb configuration (Designs 2, 7, 11)

Fig. 4.21 shows that, with the hybrid configuration, our proposed 4Δ architecture consumes 25% to 28% lesser power than Kwon's architecture, and 20% to 29% lesser power than the 5Δ architecture. It is also 6% to 23% faster than Kwon's architecture and 12% to 23% faster than the 5Δ architecture. Therefore, the hybrid configuration is best suited for the proposed architecture.
Chapter 4  
Low Voltage, Low Power 4-2 and 5-2 Compressors for Fast Arithmetic Circuits

(a) Worst case delay

(b) Average power
Chapter 4  Low Voltage, Low Power 4-2 and 5-2 Compressors for Fast Arithmetic Circuits

(c) Power-delay product

Figure 4.21 Performances of different 5-2 compressor architectures with hybrid configuration (Designs 3, 8, 12)

Fig. 4.22 shows the comparison of the architectures with CPL and DPL configurations. The designs of the 5Δ architectures, 5del_cpl and 5del_dpl, have their performances improved sensibly over the other architectures. Therefore, the CPL and DPL logic styles are more suitable to be implemented on this architecture.

(a) Worst case delay
Chapter 4  
Low Voltage, Low Power 4-2 and 5-2 Compressors for Fast Arithmetic Circuits

Figure 4.22 Performances of different 5-2 compressor architectures of dual-rail logic configuration (Designs 4, 5, 9, 13)

4.5 Summary

The architectures of 4-2 and 5-2 compressors are analyzed and different CMOS logic style circuit implementations of their constituent modules are explored. A novel 5-2 compressor architecture of 4A delay is also proposed. In order to realistically assess and compare the figures of merits of different configurations of 4-2 and 5-2 compressors at
various supply voltages, new simulation environments are established to ensure the measured performances are still sustainable when these cells are integrated in a CSA tree. The simulation results show that the 4-2 and 5-2 compressors constructed with the novel XOR* cell is able to function down to 0.6V, and features high speed and low power characteristics. Our proposed 5-2 compressor architecture outperforms all the other architectures over the range of voltages simulated, particularly when it is configured with the proposed circuits for the XOR* and the carry generator modules. Better performances against other architectures are also attained almost irrespective of the logic styles used for the circuit implementation of their constituent modules. In summary, a library of excellent power efficiency 4-2 and 5-2 compressor cells based on CMOS process technology has been developed for implementing high speed and low power multipliers operable at ultra low supply voltages.
Chapter 5

An Area and Energy Efficient IP Core for Scalar Product Computation

5.1 Introduction

Contemporary digital signal processing algorithms for image processing and telecommunication applications are increasingly predominated by matrix- and vector-like arithmetic where the same kind of simple or compound operations are carried out on different elements for two or more vectors [NAY99, OKA91, PAR00]. Examples of algorithms and applications that require such computations in their implementations are the discrete Fourier transform, the discrete cosine transform and wavelet transform used widely in speech and image processing applications [GRG01, NAY99, TON95, YU02], and finite impulse response, infinite impulse response digital filtering and adaptive decision-directed least mean square algorithms used in signal identification, waveform shaping, channel equalization, and magnetic storage technology [HAS01, MUH01, WON95, SHA93, GUN99]. Scalar (or inner) product computation forms the basis of such special purpose complex arithmetic that is inefficiently handled by software on the core central processing unit or general-purpose digital signal processor. Therefore, hardware-based implementation of scalar product computations [BRE98, DAD87, LIN01] is not only desirable, but also indispensable if we are to achieve the processing capacity required by applications such as real-time digital video and mobile communication systems. Several widely used examples of which are digital cameras, mobile phone, PDAs and pocket PCs. Low voltage operation and low power dissipation are required to perform these computation functions since the power supply of the mobile devices is
often limited from 1 to several battery cells. The trend of wireless communication system has been to move the analog-to-digital converter closer to the antenna to perform more functions digitally at an increasingly higher data rate. Therefore, the processing speed and power requirements of the digital modem and channel equalization functions will not be achievable without some form of a dedicated ASIC preprocessor [WON95, SHA93, GUN99]. Furthermore, the development of such a preprocessor or macrocell has been deterred by the excessive use of silicon area and numerous interconnections and interminable layout details to be managed and optimized manually. Because of the restricted VLSI area of the system-on-chip (SoC) applications, the design of such macrocells has often been restricted to small word widths and limited input vector sizes.

In this chapter, we present a new scalar product macrocell focusing on deriving the benefits of power, delay and area reduction efficiency for emerging process technology from the inception of algorithmic changes to architectural and physical design space explorations. Supported by the on chip auxiliary circuits, the macrocell is able to be configured either as a normal vector multiplier or as a digital filter, thus making itself versatile in a number of applications. The proposed algorithm has led to the creation of a low power, high performance bit parallel architecture for the computation of scalar products. While the number of transistors required to implement a macrocell with large input vector sizes and operand width is inevitably high, it is mitigated by a significant increase in VLSI area usage density in our method. Our architecture possesses high modularity and hierarchical regularity, enabling the use of short local interconnections and thus simplifying the global interconnects. Unlike the conventional architecture, the signal arrival time at the inputs of each module is much easier to equalize to reduce glitches and spurious transitions. Thanks to the hierarchical structure, the amount of nonrecurring engineering work is small as only one-time customization of one partial product accumulator and one vector accumulator are needed with the other partial product accumulators being reused. We illustrate the layout regularity by showing the floor planning of the macrocell design which accepts two input vectors of 16 elements with bit width of 16. Novel high performance low power building blocks such as full adder, 4-2 and 5-2 compressors presented in Chapters 3 and 4 are employed to further
enhance the performance of the scalar product core. Theoretical estimation and simulation results based on the model established on the physically extracted parameters from these leaf cells indicate that the proposed architecture is able to reduce the overall power, area and delay requirements to a greater extent than a conventional architecture. Since it is essentially a combinational design, the proposed macrocell is suitable for implementation as either a self-timed core or a synchronous multicycle core with an appropriate bus interface to high-speed processors. With the issue of only a single instruction, the macrocell can execute vector operations autonomously, effectively reducing both the instruction count and cycles per instruction factors of the processor. Very often, it is necessary to add some additional circuits to perform self-testing, error detection and correction, etc. These circuit overheads will cause the design to occupy more silicon area and incur longer delay. It is reported that some self-checking design for a VLSI processor adder occupied 13% of the area and 7% of the worst case delay [SHI91]. To facilitate the evaluation of the real performance of the core submitted for chip fabrication, the auxiliary on-chip circuits are also designed for testing the worst case delay under the restricted number of IO pins. A new built-in unique delay detection circuit with data-dependent delay error compensation is also proposed for this purpose.

This chapter is organized as follows: in Section 5.2, the new algorithm for scalar product computation is proposed. Based on the algorithm, the architecture, floor planning and delay estimation of scalar product IP core are detailed respectively in Sections 5.3, 5.4 and 5.5. The conventional architecture is compared with the proposed architecture in those sections. The layout of the proposed architecture and the design of the auxiliary circuit for the scalar product IP core in compliance with the CSM 0.18μm CMOS process technology are presented in Sections 5.6 and 5.7. This is followed by the comparison of the pre- and post- layout simulation results in Section 5.8. The chapter is concluded with a summary in Section 5.9.
5.2 Algorithm of scalar product

Let $A = [a_0, a_1, \ldots, a_{N-1}]$ and $X = [x_0, x_1, \ldots, x_{N-1}]^T$ be two $N$-element vectors. The scalar product of $AX$ is computed by

$$y = AX = [a_0 \ a_1 \ \ldots \ a_{N-1}] [x_0 \ x_1 \ \ldots \ x_{N-1}]^T = \sum_{n=0}^{N-1} a_n x_n$$

(5.1)

On a standard processor that does not have a vector multiplication instruction, the scalar product computation is obtained by a series of programmed multiply-accumulate instructions issued sequentially. The computation of (5.1) can be sped up on hardware by streaming the flow of the operands for pipelined operation. In synchronous array implementation, clock distribution problem and clock-skews due to different clock path lengths and loads need to be addressed, whereas in a self-timed design, the queue capacity have to be optimized to maintain the desired pipelining period to avoid deadlock. We now develop an algorithm to accelerate the computation of (5.1) in a fully combinational structure in order to reduce the number of intermediate stages and solve the clock distribution problem.

In what follows, we will consider the basic algorithm for VLSI implementation of (5.1) in a traditional approach. The elements of both the vectors $A$ and $X$ are represented in 2’s complement form. Their word lengths are assumed to be $R$ and $S$ bits, respectively.

$$a_n = -(a_n)_{R-1}2^{R-1} + \sum_{r=0}^{R-2} (a_n)_r 2^r$$

(5.2)

and

$$x_n = -(x_n)_{S-1}2^{S-1} + \sum_{s=0}^{S-2} (x_n)_s 2^s$$

(5.3)
where \((a_n)_r\) is the \(r\)th bit of \(a_n\) and \((x_n)_s\) is the \(s\)th bit of \(x_n\). \((a_n)_{R-1}\) and \((x_n)_{S-1}\) are the most significant bits (MSBs) of \(a_n\) and \(x_n\), respectively.

Substituting (5.2) and (5.3) into (5.1), we obtain (5.4) and (5.5)

\[
y = \sum_{n=0}^{N-1} \left[ -\left( a_n \right)_{R-1} 2^{R-1} + \sum_{r=0}^{R-2} \left( a_n \right)_{r} 2^r \right] \left[ -\left( x_n \right)_{S-1} 2^{S-1} + \sum_{s=0}^{S-2} \left( x_n \right)_{s} 2^s \right]
\]

\[= \sum_{n=0}^{N-1} \left( a_n \right)_{R-1} \left( x_n \right)_{S-1} 2^{R+S-2} + \sum_{r=0}^{R-2} \sum_{s=0}^{S-2} \left( a_n \right)_{r} \left( x_n \right)_{s} 2^{r+s}
\]

\[= \sum_{n=0}^{N-1} \left[ \left( a_n \right)_{R-1} \left( x_n \right)_{S-1} 2^{R+S-2} + \sum_{r=0}^{R-2} \sum_{s=0}^{S-2} \left( a_n \right)_{r} \left( x_n \right)_{s} 2^{r+s}
\]

\[+ \sum_{s=0}^{S-2} \left( 1 - \left( a_n \right)_{R-1} \left( x_n \right)_{s} \right) 2^{s+S-1} + \sum_{r=0}^{R-2} \left( 1 - \left( a_n \right)_{r} \left( x_n \right)_{S-1} \right) 2^{r+S-1}
\]

\[= \sum_{n=0}^{N-1} \left[ \left( a_n \right)_{R-1} \left( x_n \right)_{S-1} 2^{R+S-2} + \sum_{r=0}^{R-2} \sum_{s=0}^{S-2} \left( a_n \right)_{r} \left( x_n \right)_{s} 2^{r+s}
\]

\[+ \sum_{s=0}^{S-2} \left( 1 - \left( a_n \right)_{R-1} \left( x_n \right)_{s} \right) 2^{s+S-1} + \sum_{r=0}^{R-2} \left( 1 - \left( a_n \right)_{r} \left( x_n \right)_{S-1} \right) 2^{r+S-1} - \sum_{r=0}^{R-2} 2^r - \sum_{s=0}^{S-2} 2^s \]

\[
(5.4)
\]

\[
(5.5)
\]

Using the identity \(\sum_{r=0}^{k-1} 2^r = 2^k - 2^0\), the last two items of the constant accumulation are given by:

\[-\sum_{r=0}^{R-2} 2^r - \sum_{s=0}^{S-2} 2^s = -2^{R+S-1} + 2^0 + 2^{R-1} \]

\[
(5.6)
\]

These constant bits in (5.6) are the correction bits used to reduce the gate counts that are otherwise required for the sign extension bits of each partial product terms. Collectively, they are termed the correction vector.

Thus, (5.5) can be expressed as follows:
Chapter 5

An Area and Energy Efficient IP Core for Scalar Product Computation

\[ y = \sum_{n=0}^{N-1} (a_n)_{R-1} (x_n)_{S-1} 2^{R+S-2} + \sum_{r=0}^{S-2} \sum_{s=0}^{r-2} (a_n)_{r} (x_n)_{s} 2^{r+s} \]

\[ + \sum_{r=0}^{S-2} (a_n)_{R-1} (x_n)_{r} 2^{r+R-1} + \sum_{r=0}^{S-1} (a_n)_{r} (x_n)_{S-1} 2^{r+S-1} \]

\[ -2^{R+S-1} + 2^{S-1} + 2^{R-1} \]  

(5.7)

A scalar product IP core as implied by (5.7) requires a series of integer multipliers to produce the intermediate results and an accumulator to add them up. As an example, a 4×4 operands scalar product core with word length of four is shown in Fig. 5.1. Each symbol "*" denotes a partial product bit, and the "*'s in the shaded region denote the bit to be complemented. The symbols ‘p’ enclosed in rectangular box denote the intermediate result bits generated by each of the individual multipliers operating in parallel. They are further summed to produce the final result bits as denoted by a rectangular box of symbols “r”.

**Figure 5.1 Traditional algorithm for VLSI design of scalar product IP core**
The layout irregularity of the scalar product IP core obtained using the traditional approach often leads to an excessively low efficiency in the VLSI area usage. Also, the interconnecting wires are long, which cause longer delay in the critical path and a higher power dissipation.

We propose an alternative approach to the design of a highly modular vector multiplier. The method is based on the decomposition of the signed integer vector $A$ in (5.1) into $N$ weighted Boolean vectors. Weighted products of the input array elements gated by each Boolean vector elements are generated followed by a carry-free accumulation. The advantage of this decomposition algorithm is that the carry-propagation delay occurs only once at the very end, instead of at each addition step. This will enhance the opportunity to explore for a high density design with a more granular layout.

From (5.4):

$$y = \sum_{n=0}^{N-1} \left[ -(x_n)_{S-1} 2^{S-1} \sum_{r=0}^{R-2} (a_n)_r 2^r + \sum_{r=0}^{R-2} \sum_{s=0}^{S-2} (a_n)_r (x_n)_s 2^{r+s} 
+ (a_n)_{R-1} (x_n)_{S-1} 2^{R+S-2} - (a_n)_{R-1} 2^{R+S-2} \sum_{s=0}^{S-2} (x_n)_s 2^s \right]$$

$$= \sum_{n=0}^{N-1} \left\{ \sum_{r=0}^{R-2} 2^{r+S-1} \sum_{s=0}^{S-2} (a_n)_r (x_n)_s 2^{r+s} - \left[ \left(1 - (a_n)_{R-1} (x_n)_{S-1} \right) 2^{R+S-2} + (a_n)_{R-1} 2^{R-1} \sum_{s=0}^{S-2} (x_n)_s 2^s - 2^{R+S-2} \right] \right\}$$

$$= \sum_{n=0}^{N-1} \left\{ \sum_{r=0}^{R-2} (a_n)_r (x_n)_{S-1} 2^{r+S-1} + \sum_{r=0}^{R-2} \sum_{s=0}^{S-2} (a_n)_r (x_n)_s 2^{r+s} - \left[ (a_n)_{R-1} (x_n)_{S-1} 2^{R+S-2} + (a_n)_{R-1} 2^{R-1} \sum_{s=0}^{S-2} (x_n)_s 2^s \right] \right\}$$

$$= 2^{R+S-2} - \sum_{r=0}^{R-2} 2^r \right\}$$

(5.8)
Chapter 5  An Area and Energy Efficient IP Core for Scalar Product Computation

A regular array of accumulation units will not only utilize the silicon area more efficiently, but also reduce the glitches caused by the unequal delay paths of the input signals to the accumulation units. In order to obtain a regular layout of the accumulation units, we swap the accumulating order of \( r \) and \( n \).

\[
y = \sum_{r=0}^{R-2} \left[ \sum_{n=0}^{N-1} \left( (a_n)_r (x_n)_s 2^{r+S-1} + \sum_{s=0}^{S-2} (a_n)_r (x_n)_s 2^s \right) \right] - \sum_{n=0}^{N-1} \left( (a_n)_{R-1} (x_n)_s 2^{R+S-2} + 2^{R-1} \sum_{s=0}^{S-2} (a_n)_{R-1} (x_n)_s 2^s \right) + \sum_{n=0}^{N-1} 2^{S-1}
\]

\[
y = \sum_{r=0}^{R-2} \left[ 2^r \sum_{n=0}^{N-1} \left( (a_n)_r (x_n)_s 2^{r+S-1} + \sum_{s=0}^{S-2} (a_n)_r (x_n)_s 2^s \right) \right] - 2^{R-1} \sum_{n=0}^{N-1} \left( (a_n)_{R-1} (x_n)_s 2^{R+S-1} + \sum_{s=0}^{S-2} (a_n)_{R-1} (x_n)_s 2^s \right) + \sum_{n=0}^{N-1} 2^{S-1}
\]

(5.9)

The width of \( x_n \) is \( S \) bits. Let \( \xi = \lceil \log_2 N \rceil \) be the smallest integer larger than or equal to \( \log_2 N \). The expressions in both the positive and negative terms of (5.9):

\[
\sum_{n=0}^{N-1} \left[ (a_n)_r (x_n)_s 2^{r+S-1} + \sum_{s=0}^{S-2} (a_n)_r (x_n)_s 2^s \right] \quad r = 0, 1, \ldots, R-1
\]

(5.10)

are the unsigned numbers, each of which can always be represented in exactly \((S + \xi)\) bits in normal unsigned binary format, where the \( r \)th expression for \( r = 0, 1, \ldots, R-1 \) is given by \( \sum_{n=0}^{S+\xi-1} (z_r)_i \cdot 2^i \). However, in order to avoid the long delay to generate the intermediate results, we use the stored carry format instead of the normal unsigned binary format for their representation. The stored carry format of the intermediate result always has the formation shown in Fig. 5.2, where one row is the sum and the other row is the stored carry. The MSBs of the sum and the stored carry are both located at column numbers \( 2^{S+\xi-2} \). The binary value represented by the stored sum and carry is equal to the unsigned binary number representation given in the third row.
Thus, (5.10) can be expressed in stored carry format as follows.

\[
\sum_{n=0}^{N-1} \left[ (a_n)(x_n) 2^{S_n-1} + \sum_{i=0}^{S_n-2} (a_n)(x_n) 2^i \right] = \sum_{n=0}^{S_n-2} (s_n) 2^n + \sum_{i=0}^{S_n-2} (c_n) 2^i \quad r = 0, 1, \ldots, R - 1
\]  

(5.11)

where \((s_n)\) and \((c_n)\) are the \(n\)th bits of stored sum and stored carry used to represent the \(n\)th expression of (5.10).

Substituting (5.11) into (5.9), we have

\[
y = \sum_{r=0}^{R-2} 2^r \left[ \sum_{i=0}^{S_r-2} (s_r) 2^i + \sum_{i=1}^{S_r-2} (c_r) 2^i \right] \\
- 2^{R-1} \left[ \sum_{i=0}^{S_{R-1}-2} (s_{R-1}) 2^i + \sum_{i=1}^{S_{R-1}-2} (c_{R-1}) 2^i \right] + \sum_{n=0}^{N-1} 2^{S_n-1} \\
= \sum_{r=0}^{R-2} 2^r \left[ \sum_{i=0}^{S_r-2} (s_r) 2^i + \sum_{i=1}^{S_r-2} (c_r) 2^i \right] \\
+ 2^{R-1} \left[ \sum_{i=0}^{S_{R-1}-2} (1-(s_{R-1})) 2^i + \sum_{i=1}^{S_{R-1}-2} (1-(c_{R-1})) 2^i \right] \\
+ N2^{S_{R-1}-2^{R-1}} - 2^{R-1} \left[ 2^{S_{R-1}} - 1 + 2^{S_{R-1}} \right] \\
= \sum_{r=0}^{R-2} 2^r \left[ \sum_{i=0}^{S_r-2} (s_r) 2^i + \sum_{i=1}^{S_r-2} (c_r) 2^i \right] \\
+ 2^{R-1} \left[ \sum_{i=0}^{S_{R-1}-2} (s_{R-1}) 2^i + \sum_{i=1}^{S_{R-1}-2} (c_{R-1}) 2^i \right] + N2^{S_{R-1}-2^{R-1}} \left[ 2^{R} + 2^{R-1} \right] \\
+ 2^{R-1} \left[ 2^{S_{R-1}} - 1 + 2^{S_{R-1}} \right] \\
(5.12)
\]

The correction vector in (5.12) is \(N \cdot 2^{S_{R-1}-2^{R-1}} + 2^{R-1} + 2^{R} \).
Comparing (5.7) and (5.12), we found that (5.7) needs \( N(S-1) + N(R-1) \) inverters to complement the bits while (5.12) needs \((S+\xi-1) + (S+\xi-2) + NR\) inverters. To illustrate the significance of its implication, consider the case of \( N = 16 \) and \( S = R = 16 \), then \( \xi = \log_2 N = 4 \), (5.7) requires 480 inverters whereas (5.12) requires only 293 inverters, which represents a saving of 39% of inverters and potential spurious transitions.

In terms of VLSI layout, every arithmetic module is regular in shape. Using the same example and notations as in Fig. 5.1, the structural differences of the proposed method are shown in Fig. 5.3.

* * * * * * * * * *

** Figure 5.3 Proposed algorithm for VLSI design of vector multiplier **
5.3 Architectural design of the scalar product core

Previous attempts to design a dedicated macrocell for accelerating the computation of scalar product have often been thwarted by the limitation of silicon area and restricted routing channels available for the interconnecting wires. This and the following sections address the unique advantages of our algorithm when it is mapped into hardware architecture, particularly when the design is to be implemented in deep-submicron technology where the signal propagations and transitions are dominated by the interconnect delay. We will consider the design example of a scalar product macrocell, operating on two input vectors. Each input vector consists of 16 signed integer elements in two’s complement form. The word length of each element is 16. In other words, \( N = 16 \), and \( R = S = 16 \) for (5.12).

The architecture of the scalar product core is mainly composed of three parts, namely the partial product generator, the partial product accumulator, and the vector accumulator, as shown in Fig. 5.4. For convenience, they are abbreviated as PPG, PPA and VA, respectively.

![Proposed architecture for the scalar product core](image)

The PPG generates the partial products from the input vector \( X \) according to the bit value of \( (a_n) \). It is composed of a row of AND gates, except for the most significant bit, which is generated by a NAND gate because the MSB needs to be complemented.
Chapter 5  
An Area and Energy Efficient IP Core for Scalar Product Computation

The PPA sums up all the partial products of the same weight and generates the intermediate result. This makes the accumulating circuit form a rectangle structure, as shown in Fig. 5.5. Such structure permits optimization of routing channels with short equal interconnects between neighboring arithmetic cells.

![Figure 5.5 Rectangular structure of PPA](image)

The accumulator is realized by a Wallace tree structure [PAR00, VAI01, WAL64, DAD65], which consists of three layers of (4,2) compressors [PAR00, VA101]. Instead of summing to a final result that often needs a time-consuming carry propagation adder (CPA), every accumulator just produces a stored carry intermediate result generated by the third layer of (4,2) compressors, as shown in Fig. 5.6. The elimination of the internal CPAs by keeping the intermediate sum in stored carry format will reduce the computation time and glitches.

![Figure 5.6 Wallace tree structure of PPA](image)
The VA sums up all the 16 stored carry intermediate results generated by the PPA's, which is equivalent to 32 binary numbers. These stored carry numbers are weighted differently and their dot notation forms a parallelogram as shown in Fig. 5.7. All the bits in the stored carry sum of highest weight are negated before the accumulation. The letter "e" in the figure denotes the correction bits, which is $2^{35} + 2^{19} + 2^{16} + 2^{15}$. Compared with the other 16 regular PPAs, this single irregularity of VA is trivial. Moreover, the 16 PPAs cover most area of the chip, and the arithmetic circuits of the VA are interleaved uniformly over the PPAs (refer to the floor planning in Fig. 5.10 of Section 5.4), making the irregularity almost indiscernible.

The vector accumulator also employs the Wallace tree structure with four layers of (4,2) compressors, as shown in Fig. 5.8. The correction value is added to the last layer of (4,2) compressors to form the (5,2) compressors. The final 36-bit result is computed by a CPA.
5.4 Floor planning of the scalar product core

Owing to the large number of transistors involved in the design, it is extremely difficult to make a custom layout of the entire circuit. Fortunately, the architecture of the scalar product macrocell is highly modular, making it possible to handcraft the physical design hierarchically. We use the leaf cells presented in the previous chapters to assemble them into functional blocks. The functional blocks are abutted to form the complete circuit. The goal of our floor planning is to make the interconnecting wires as short as possible and the difference in the lengths of the interconnecting wires of adjacent layers of the Wallace tree as small as possible.

The main leaf cells of the core are (4,2) compressors, (5,2) compressors, full adders, 4-bit carry look ahead adders, and NAND and AND gates for the PPG. In order to make a quality floor planning, we list the sizes of these leaf cells prototyped on CSM 0.18μm process technology in Table 5.1, most of which have already been introduced in Chapters 3 and 4.
**Table 5.1** The sizes of different leaf cells (on CSM 0.18μm CMOS process)

<table>
<thead>
<tr>
<th>Leaf Cell</th>
<th>Width, w (μm)</th>
<th>Height, h (μm)</th>
<th>Area, a (μm²)</th>
</tr>
</thead>
<tbody>
<tr>
<td>(4,2) compressor</td>
<td>24.2</td>
<td>17</td>
<td>411.4</td>
</tr>
<tr>
<td>(5,2) compressor</td>
<td>31.4</td>
<td>17</td>
<td>533.8</td>
</tr>
<tr>
<td>full adder</td>
<td>12</td>
<td>10</td>
<td>120</td>
</tr>
<tr>
<td>CLA4</td>
<td>21.3</td>
<td>32.3</td>
<td>687.99</td>
</tr>
<tr>
<td>AND</td>
<td>4.6</td>
<td>4.2</td>
<td>19.32</td>
</tr>
</tbody>
</table>

The floor planning of the PPA with the PPG is shown in Fig. 5.9. Every rectangular block labeled with a number is a column of (4,2) compressors. The number refers to the layer number of the block in the Wallace tree. For example, the first layer of blocks labeled "1" comprises those blocks whose inputs are fed directly from the PPG. The second layer of blocks labeled "2" receives its inputs from the outputs of the nearest left and right blocks labeled "1". Thus, Block "1" has 16 (4,2) compressors, Block "2" 17 compressors, and Block "3" 18 compressors. Their areas are given by $S_{PPA1} = w_{42} \times 16h_{42}$, $S_{PPA2} = w_{42} \times 17h_{42}$ and $S_{PPA3} = w_{42} \times 18h_{42}$, respectively.

![Figure 5.9 Floor planning of PPA with PPG](image-url)
Chapter 5  An Area and Energy Efficient IP Core for Scalar Product Computation

The PPG (the thin strips marked with cross hatch lines on Fig. 5.9) are arranged along one side of Block "1" of the PPA. This is because the four input pads of (4,2) compressor are located vertically along one side of Block "1" and the height of every four AND gates are almost equal to that of a (4,2) compressor. In this way the length of the connecting wires are minimized from PPG to PPA Block "1". Every Block "1" needs four 16-bit partial product generators, which is equivalent to 64 elements, the size of which is \( S_{PPG64} = w_{AND} \times 64 h_{AND} \approx w_{AND} \times 16 h_{42} \).

As mentioned earlier, Block "2" is sandwiched between two adjacent banks of Block "1", and its outputs are connected to Block "3". Similarly, Block "3" is interpolated such that it is equi-distance from two nearest left and right banks of Blocks "2" in view of the large number of wires connecting Blocks "1" and "2", and Blocks "2" and "3". Such an arrangement would reduce the wire lengths as much as possible, minimizing the switching power and glitches caused by the stray capacitance of the wires. The area efficient rectangular formation of the PPA is a result of mapping the proposed algorithm.

The PPA with PPG in Fig. 5.9 covers a rectangular area of \( S_{PPA} = w_{PPA} \times h_{PPA} = (7w_{42} + 4 w_{AND}) \times 18h_{42} = 187.8 \times 306 = 57466.8 \mu m^2 \). However, this rectangular area is not fully utilized. Only the area of \( S_{PPA} = 4S_{PPG64} + 4S_{PPA1} + 2S_{PPA2} + S_{PPA3} = 52727.2 \mu m^2 \) is covered by the leaf cells. The VLSI area usage ratio of the PPA is therefore:

\[
\rho_{PPA} = \frac{S_{PPA}}{S_{PPA}} = \frac{52727.2}{57466.8} = 91.8\% \quad (5.13)
\]
The scalar product core needs 16 such PPAs with PPGs together with one VA. The complete floor planning of the macrocell is shown in Fig. 5.10. The roman number labelled blocks belong to the VA, where the arabic numbers refer to the layer numbers of the PPA blocks in the Wallace tree.

The width of Layer “i” is 19 bits, Layer “ii” 21 bits and Layer “iii” 24 bits. Layer “iv” is of 32 bits, which is implemented by the (5,2) compressors. The CPA layer is of 32 bits.

Figure 5.10 Floor planning of the proposed bit-parallel scalar product core
Therefore, the areas occupied by the abovementioned layers are $S_{VAi} = w_{42} \times 19h_{42}$, $S_{VAll} = w_{42} \times 21h_{42}$, $S_{VAll} = w_{42} \times 24h_{42}$, $S_{VAiv} = w_{52} \times 19h_{52}$, and $S_{CPA} = w_{CLA4} \times 8h_{CLA4}$.

The proposed scalar product core needs a total rectangular silicon area of:

$$S_{SPC-I} = w_{SPC-I} \times h_{SPC-I} = (4w_{PPA} + 4w_{42} + w_{52} + w_{CLA4}) \times (4 \times 21h_{42}) = 900.7 \times 1428 = 1286199.6 \, \mu m^2.$$  (5.14)

Of which the physically used area is

$$S_{SPC-I} = 16S_{PPA} + 8S_{VAi} + 4S_{VAll} + 2S_{VAiv} + S_{VAiv} + S_{CPA} = 976118.92 \, \mu m^2.$$  (5.15)

Therefore, the overall area usage efficiency of the scalar product core is:

$$\rho_{SPC-I} = \frac{S_{SPC-I}'}{S_{SPC-I}} = \frac{976118.92}{1286199.6} = 75.9\%.$$  (5.16)

To make a fair comparison and to illustrate the advantage of our proposed method, the architecture offered by the conventional method is laid out using the same leaf cells with the same floor planning goal. A 16×16-bit standard multiplier is the main building block of the straightforward approach to implement the architecture of the scalar product macrocell. Its layout is shown in Fig. 5.11. The shaded blocks are PPG.

Figure 5.11 Floor-planning of normal 16×16-bit multiplier

127
Layer "1" is of 17 bits length, composed of 3 full adders and 14 (4,2) compressors. Layer "2" has 21 bits length, composed of 8 full adders and 13 (4,2) compressors. Layer "3" 29 bits, composed of 16 full adders and 13 (4,2) compressors. Layer "CPA" has 28 bits. Their areas are denoted as $S_1$, $S_2$, $S_3$ and $S_{CPA-1}$, respectively. It is evident that the layout of the multiplier is irregular if the aim of the design is to minimize the delay of the interconnecting wires. Regulating the layout will make both the local and global interconnecting wires longer and their routings interlace irregularly, thus increasing the imparity of the signal arrival time and glitch power.

The multiplier in Fig. 5.11 covers a rectangular area of:

$$S_{MUL16} = w_{MUL16} \times h_{MUL16} = (7w_{42} + 4w_{AND} + w_{CLA-4}) \times 30h_{42} = 209.1 \times 510 = 106641 \mu m^2.$$  \hspace{1cm} (5.17)

The physically used area is

$$S_{MUL16}' = 4S_{PPG} + 4S_1 + 2S_2 + S_3 + S_{CPA-1} = 4 \times 16 \times 4S_{AND} + 4(3S_{FA} + 14S_{42}) + 2(8S_{FA} + 13S_{42}) + (16S_{FA} + 13S_{42}) + 7S_{CLA-4} = 54124.85 \mu m^2.$$  \hspace{1cm} (5.18)

Therefore, the area usage efficiency of the multiplier is:

$$\rho_{MUL16} = \frac{S_{MUL16}'}{S_{MUL16}} = \frac{54124.85}{106641} = 50.8\%.$$  \hspace{1cm} (5.19)

The floor planning of the entire macrocell of the conventional architecture is shown in Fig. 5.12. The roman number labeled blocks perform the accumulation of all the intermediate products. Block "i" is of 32 bits, Block "ii" 33 bits and Block "iii" 34 bits, all of which are implemented using (4,2) compressors. The CPA block besides Block "iii" is of 36 bits. Their areas are denoted as $S_i$, $S_{ii}$, $S_{iii}$ and $S_{CPA-2}$, respectively.

The whole conventional scalar product macrocell covers a rectangular area of:

$$S_{SPC-2} = w_{SPC-2} \times h_{SPC-2} = (4w_{MUL16} + 3w_{42} + w_{CPA-2}) \times (4 \times 32h_{42}) = 930.3 \times 2176 = 2024332.8 \mu m^2.$$  \hspace{1cm} (5.20)
Chapter 5  

*An Area and Energy Efficient IP Core for Scalar Product Computation*

The physically used area is:

\[
S_{\text{SPC-2'}} = 16S_{\text{MUL16'}} + 4S_i + 2S_{\text{ii}} + S_{\text{CPA-2}} = 16S_{\text{MUL16'}} + 4 \times 32S_{\text{q2}} + 2 \times 33S_{\text{q2}} + 34S_{\text{CLA4}} = 965988.71 \, \mu\text{m}^2. \tag{5.21}
\]

Therefore the VLSI area usage efficiency of the conventional scalar product macrocell is:

\[
\rho_{\text{SPC-2}} = \frac{S_{\text{SPC-2'}}}{S_{\text{SPC-2}}} = \frac{965988.71}{2024332.8} = 47.7\%. \tag{5.22}
\]

Figure 5.12  
*Floor planning of the conventional scalar product macrocell*
Chapter 5  
An Area and Energy Efficient IP Core for Scalar Product Computation

The layout characteristics of the conventional and proposed architectures are summarized in Table 5.2. It is evident that the proposed architecture has advantages over the conventional one. There is a saving of 36.5% of silicon area and up to 59% increase in terms of the efficiency of the area usage.

<table>
<thead>
<tr>
<th>Layout Area Width (µm)</th>
<th>Proposed</th>
<th>Conventional</th>
</tr>
</thead>
<tbody>
<tr>
<td>900.7</td>
<td>930.3</td>
<td></td>
</tr>
<tr>
<td>Layout Area Height (µm)</td>
<td>1428</td>
<td>2176</td>
</tr>
<tr>
<td>Total Area (µm$^2$)</td>
<td>1286199.6</td>
<td>2024332.8</td>
</tr>
<tr>
<td>Area Efficiency</td>
<td>75.9 %</td>
<td>47.7 %</td>
</tr>
<tr>
<td>Regularity of Blocks</td>
<td>Regular</td>
<td>Irregular</td>
</tr>
</tbody>
</table>

5.5 Delay estimation based on floor planning

The computation time or the delay is mainly caused by the gate delay and the interconnection delay [KAT00, SYL98, NAN99]. For both architectures, there is not much difference in the gate delay. However, the interconnection delay, especially the one in the vector accumulator, varies significantly. In what follows, we model theoretically the critical delay from the inputs of the PPG to the generation of the final result.

The gate delay of the proposed architecture is composed of three parts, namely the delays for the PPG, PPA and VA. The gate delay of the PPG is given by $t_{PPG} = t_{AND}$. Since PPA has $[(\log_2 n) - 1]$ layers of (4,2) compressors, its gate delay is $t_{PPA} = [((\log_2 n) - 1)t_{42}$, where $n$ is the word length of the element of the input vector, $t_{42}$ is the longest delay of the (4,2) compressor. Therefore,

$t_{PPA} = 3t_{42}. \quad (5.23)$

The critical path of VA is composed of three (4,2) compressors in the first three layers, one (5,2) compressor in the fourth adder and two 32-bit length CLAs in the CPA layer. Therefore, the delay of the vector accumulator is given by:

$t_{VA} = 3t_{42} + t_{52} + t_{CLA32}. \quad (5.24)$

130
The interconnection delay is technology dependent. It is proportional to the stray capacitance of the connecting wires. Since the width of the wire $\delta$ is normally fixed by the feature size of the process, the capacitance is proportional to the length $\Lambda$ of the wire. Therefore the interconnection delay is also proportional to $\Lambda$. For a first order estimation, let $k$ be the constant of proportionality for a given technology, the delay of wire is given by:

$$t_{\text{wire}} = k\delta\Lambda.$$  \hfill (5.25)

The maximum length of the connecting wire in the critical path can be expressed as:

$$A_{\text{maxim-1}} = A_{\text{PPG-1}} + A_{1-2} + A_{2-3} + A_{3-i} + A_{i-ii} + A_{ii-iii} + A_{ii-iv} + A_{iv-CPA}$$  \hfill (5.26)

where the arabic and roman numbers refer to the layer numbers in the Wallace tree of the PPA and VA, respectively. For convenience, we assume that the interconnecting lines are linked between the centers of two connected blocks. From Fig. 5.10, $A_{\text{PPG-1}} = 0.5(w_{\text{AND}} + w_{42})$, $A_{1-2} = w_{42}$, $A_{2-3} = 2w_{42} + w_{\text{AND}}$, $A_{3-i} = 4w_{42} + 2w_{\text{AND}}$, $A_{i-ii} = w_{42} + 0.5h_{\text{VAii}}$, $A_{ii-iii} = 0.5(w_{42} + w_{52}) + h_{\text{VAii}}$ and $A_{iv-CPA} = 0.5(w_{52} + w_{\text{CLA4}})$.

Therefore $A_{\text{maxim-1}} = 14.4 + 24.2 + 53 + 106 + 212 + 202.7 + 384.8 + 52.7 = 1049.8 \mu m$.

The worst case delay of the proposed architecture is given by:

$$T_{SPC-1} = t_{\text{gate-1}} + t_{\text{wire-1}} = t_{\text{PPG}} + t_{\text{PPA}} + t_{VA} + k\delta A_{\text{maxim-1}}$$

$$= t_{\text{AND}} + 6t_{42} + t_{52} + t_{\text{CLA32}} + k\delta \times 1049.8\mu m.$$  \hfill (5.27)

In the conventional architecture, the critical path of a 16-bit multiplier is composed of one partial product generator, one (4,2) compressor each in the first, second and third layers, and one carry look ahead adder in the CPA layer of the multiplier.

The delay of a multiplier is given by:

$$t_{\text{mul}} = t_{\text{AND}} + 3t_{42} + t_{\text{CLA4}}.$$  \hfill (5.28)
The critical path in the intermediate product accumulator includes three layers of (4,2) compressors and 28 bit-length carry look ahead adders in the CPA layer. Its delay is
\[ t_{PA} = 3t_{42} + t_{CLA28} \]  
(5.29)

From Fig. 5.12, the interconnection delay can be estimated as follows:
\[ A_{\text{maxim-2}} = A_{\text{PPG-1}} + A_{1-2} + A_{2-3} + A_{3-\text{CPA1}} + A_{\text{CPA1-i}} + A_{\text{i-ii}} + A_{\text{ii-CPA2}} \]  
(5.30)
where
\[ A_{\text{PPG-1}} = 0.5(w_{\text{AND}} + w_{42}), \quad A_{1-2} = w_{42}, \quad A_{2-3} = 2w_{42} + w_{\text{AND}}, \quad A_{3-\text{CPA1}} = 0.5(w_{42} + w_{\text{CLA4}}), \quad A_{\text{CPA1-i}} = w_{\text{MUL16}} + 3w_{42} + 2w_{\text{AND}} + 0.5w_{\text{CLA4}}, \quad A_{\text{i-ii}} = w_{42} + 0.5h_{i}, \quad A_{\text{ii-CPA2}} = 0.5(w_{42} + w_{\text{CLA4}}). \]

Thus, we have
\[ A_{\text{maxim-2}} = 14.4 + 24.2 + 53 + 52.7 + 301.55 + 296.2 + 568.2 + 52.7 = 1362.95 \mu s. \]

Therefore, the worst case delay of the conventional architecture is given by:
\[ T_{\text{SPC-2}} = t_{\text{mul}} + t_{PA} + t_{\text{wire-2}} = t_{\text{AND}} + 6t_{42} + t_{\text{CLA4}} + t_{\text{CLA28}} + k\delta \times 1362.95 \mu s. \]  
(5.31)

The estimated worst case delay is shown in Table 5.3. The interconnect delay ratio of the proposed architecture to the conventional architecture is given by \[ \frac{k\delta \times 1049.8 \mu s}{k\delta \times 1362.95 \mu s} = 77\%. \]

A saving 23% of the delay from the interconnecting wires is achieved. Although the gate delay is almost the same for the two architectures, the interconnect delay varies significantly. It should be noted that the good floor planning alternative and architecture in this case is a desirable outcome of the novel algorithm introduced in Section 5.2.

Table 5.3 Estimation of the worst case delay

<table>
<thead>
<tr>
<th></th>
<th>Proposed</th>
<th>Conventional</th>
</tr>
</thead>
<tbody>
<tr>
<td>Gate Delay</td>
<td>( t_{\text{AND}} + 6t_{42} + t_{\text{CLA32}} )</td>
<td>( t_{\text{AND}} + 6t_{42} + t_{\text{CLA4}} + t_{\text{CLA28}} )</td>
</tr>
<tr>
<td>Interconnect Delay</td>
<td>( k\delta \times 1049.8 \mu s )</td>
<td>( k\delta \times 1362.95 \mu s )</td>
</tr>
</tbody>
</table>
5.6 The scalar product core layout in CSM 0.18µm technology

Based on the optimally floor planning of the proposed scalar product core, the entire chip is implemented on Chartered CSM 0.18µm 6-Metal 1-Poly CMOS 1.8V technology\(^1\). Fig. 5.13 shows the layout of one PPA, which validates the succinctness of the floor planning of the proposed architecture. The layout size of the PPA is 295µm in width by 308 µm in height while the earlier estimated size from the floor planning is 187.8µm by 306µm. The height is almost the same, but the error in the width estimation is notable. The discrepancy is due to the insertion of the data bus lines, and the local power and ground lines vertically between Blocks "1" and "3" instead of overlapping them with the PPA blocks. This is to reduce the amount of stray capacitance and minimize the interference with the data bus lines.

Fig. 5.14 shows the complete chip layout of the proposed scalar product computation unit. It consists of the core for the scalar product calculation and some auxiliary circuits, such

---

\(^1\) Chartered CSM 0.18µm CMOS process is closely compatible to the TSMC 0.18µm CMOS process, both in design rules and parasitic extraction. The use of Chartered CSM 0.18µm CMOS process is mainly due to the foundry accessibility for geographic reason at the time of preparing the layout.
as the on-chip registers to hold the input operands and the output results, testing circuit for delay measurement, and IO pads. The blocks from VA-i to VA-iv and the peripheral circuits are labeled in the figure. The physical layout resembles the floor planning sketch of Fig. 5.10. The core circuit, containing 226560 transistors, occupies a rectangular silicon area of approximately 1430µm × 1560µm. The whole chip covers an area of 2600µm × 2900µm.

Figure 5.14 Layout of the scalar product macrocell
The six layers of metal and one layer of poly silicon are carefully planned for the interconnections. Metal-1, Metal-2 and Poly are used by the basic cells such as the compressors. For better conductivity, the only layer of Poly is never used for the inter-cell or global connecting lines. Metal-3 and Metal-4 are used for the interconnecting lines of the PPAs. These two layers of metals are also used for the interconnecting lines of the VAs, where they do not cross over the PPAs. Metal-5 is used to connect the PPA and the VA because these lines cross over the area where Metal-3 and Metal-4 are used. Almost all of the interconnecting lines are run horizontally.

The top metal, Metal-6, is reserved for the power and ground lines linking every block of the internal core to the power ring lying around the chip. In blocks such as PPAs and VAs, the local power and ground lines, usually occupying Metal-1 and Metal-2, are arranged vertically along the columns of the compressors, such as "vdd" and "gnd" shown in Fig. 5.15. The global power and ground lines in Metal-6 are laid horizontally across the chip because the data bus lines are laid vertically. Thus, the stray capacitance between the data lines and the global power and ground are minimized. The horizontal interconnecting lines and the global power and ground lines are also carefully arranged to avoid overlapping. Both the horizontal global power lines and global ground lines are laid in

**Figure 5.15 Local and global power ground lines arrangement**
the shape of a pair of combs, with their fingers nearly clasped together as shown in Fig. 5.15. The comb structure of the global power line makes the local power (vdd) line segment enclosed in the ellipse “A” of Fig. 5.15 becomes parallely connected with the local power line segment enclosed in the ellipse “B”. Otherwise, they would have been connected in series. The same parallel connection is also implied in the local ground (gnd) line segments enclosed in the ellipses “A” and “B”. Therefore, the current density requirement on the local power and ground lines is lowered and their widths can thus be minimized to reduce the area of the chip.

5.7 Design of the on-chip auxiliary circuit

There are a large number of inputs and outputs of the scalar product core. The number of input bits is \(N \times R + N \times R = 16 \times 16 + 16 \times 16 = 512\). The number of output bits is 36. It is impractical to use an IC package with more than 500 pins. Instead, a 32-pin package is used. The registers are used to multiplex the large number of inputs/outputs into the limited I/O pins. In order to eliminate the spurious computations during the register data preparation, each output of the register is gated by a D-latch, which is controlled by a signal “compute”. Fig. 5.16 shows the structure of the input data registers. When “compute” is inactive, the D-latches retain the previous data fed into the scalar product core. The core is at rest and no extra power is dissipated before all the input data are written to the registers. When all data are registered, the signal “compute” is activated to release these input bits through the D-latches to the core for a new computation. Since “compute” triggers a new cycle of calculation, it will also be used in the delay detection circuits. The registers operate in two different modes. In the memory mode, each register can be accessed with a unique address. This mode is useful to configure the core as a normal vector multiplier. In the FIFO mode, the data can be moved from one register to the next. This mode allows the core to be used as a FIR filter.
To find out the worst case delay of a circuit, it is required to measure the delay of each computation. Several methods can be used, one of which is to measure every output to find out the elapsed time from the time the input is applied to the time the last stable output is obtained. The other is to feed the circuit with enough input data at high rate until the circuit fails to function [MAK96]. The period of the data rate at the verge of malfunction is the worst case delay. Both methods require reading the outputs at the correct time. Due to the pin limitation, not all the outputs of our design can be measured simultaneously, it is impossible to apply either method. A dedicated build-in circuit to measure the delay is ineluctable.

The delay detection circuit compares all the outputs of scalar product core with the expected outputs stored in the result compare registers. For our design, a row of 36 XNOR gates followed by a 36-input AND gate are used to compare the outputs and signify the completion of computation. The 36-input AND gate is implemented as a multi-level logic circuit as shown in Fig. 5.17. The structure is composed of seven 6-input AND gates and each of the 6-input AND gates is realized with a two-level NAND-NOR implementation comprising two 3-input NAND gates and one 2-input NOR gate. The extra delay of the XNOR gates and 36-input gate is introduced into the detection circuit. This delay surplus is not easy to measure on chip. Besides, it is also data dependent, i.e., the delay of the transition from the previous result $Y_1$ to the current result $Y_2$ is different from that of the transition from $Y_3$ to $Y_2$. Therefore, this surplus should be compensated from data to data, instead of treating it as a constant value.
Chapter 5

An Area and Energy Efficient IP Core for Scalar Product Computation

Figure 5.17 The 36-input AND gate

Fig. 5.18 shows the proposed delay detection circuit for 1 bit. It consists of two parts. One part enclosed in the dashed rectangle labeled “d” on the left is the core result comparison circuit, which generates an aggregate delay of the core and the detection circuit; the other part enclosed in the right dashed rectangle labeled “c” is the compensation circuit used to compensate for the detection circuit’s delay. Prior to the computation when the signal “compute” is inactive, the expected result data should be set in the compare register. The multiplexer in the dashed rectangle “c” feeds the previously completed output \( Y \), which comes from the output of the scalar product core, into the matching detection circuit, consisting of an XNOR gate followed by a 36-input AND gate. When a new computation is activated by the “compute” signal, the multiplexer of the compensate circuit switches from the core output to the expected result \( Y' \). Thus the compensation circuit takes \( t_{TG} + t_{XNOR} + t_{NAND36} \) to compare the change from \( Y \) to \( Y' \) until the output “Tc” indicates a match. The core result comparison circuit in the dashed rectangle “d” is composed of an XNOR and a 36-input AND gates. The XNOR is constantly fed with the expected data \( Y' \) and the core’s output. The latter is fed through a multiplexer on the left. The multiplexer connects both inputs to the core’s output in order to insert a corresponding delay of \( t_{TG} \). Therefore the core result comparison circuit takes \( t_{CORE} + t_{TG} + t_{XNOR} + t_{NAND36} \) to generate a match indication in “Td”. The timing difference of “Te” and “Td”, denoted as \( t_{CORE} \), is the computational delay of the core. With this delay detection circuit, only two pins, “Te” and “Td”, need to be monitored.
The layout of the delay detection circuit is shown in Fig. 5.19. Every pair of 1-bit core result comparison circuit and compensation circuit should be placed as close as possible in the layout to minimize the wire length difference to the 36-input AND gates for a more accurate delay measurement.

For the power measurement, only the power dissipation of the scalar product core is of interest. The power consumed by the auxiliary circuits should be excluded. We arranged two power supply pins, one for the core, the other for the auxiliary circuits, to facilitate the power measurement of the core.
Chapter 5
An Area and Energy Efficient IP Core for Scalar Product Computation

The layout of the chip completed with the auxiliary test circuitries has passed all design rules checks and verification tests. At the time of writing this thesis, it has just been submitted for a multi-project wafer (MPW) fabrication run.

5.8 Simulation Results

Three types of simulation of the two circuit architectures are made, namely the pre-layout circuit simulations of the conventional and the proposed architectures, and the post-layout simulation of the proposed circuit. The circuit simulator used is Synopsys Powermill Version 5.3. All the circuits are simulated at supply voltages range from 0.7V to 3.3V under the latest Chartered CSM 0.18µm CMOS technology. For pre-layout simulations, the architecture (high level structure) of the circuit is described in Verilog Hardware Description Language (Verilog HDL) and its gate level and transistor level (low level circuit) descriptions are written in SPICE format. For the post-layout simulation, the netlists, together with their parasitic parameters are extracted from the layout of the circuit. For fair comparison, both architectures are constructed from the same elementary cells and simulated with the same model file. The transistors models are from the CSM 0.18µm process library file revision 1G for Star-HSPICE Level 53. The resolution of the simulation time is set to 0.01ns. The frequency of computation (the rate at which data are input for simulation) is 50 MHz for supply voltages higher than 1V, and 10MHz for supply voltages below or equal to 1V. The 1024 input data patterns are generated randomly using MATLAB. All circuits are fed with the same input data stimuli. The average power dissipation at each voltage is determined from the measured power supply current averaged over the entire sequence of data inputs. The worst-case propagation delay is taken as the longest delay of the output from the input among all computations. The power efficiency is defined as the product of the average power dissipation and the worst case delay.
The final implementations of the conventional and proposed vector multiplier circuits use 189736 and 182990 transistors, respectively. The difference is due to the algorithmic and architectural improvement as both circuits use the same fundamental adder and compressor cells. The power dissipation, worst-case delay, and power efficiency at different voltages are measured and shown in Table 5.4. The pre-layout simulation results of the conventional circuit and the proposed circuit are denoted as I and II respectively, and the post-layout simulation results of the proposed circuit is denoted as III.

To highlight the differences between the pre- and post-layout results of our proposed circuit and its performance gain over the conventional implementation, the figures of merits of the three simulations are plotted against supply voltages in Fig. 5.20 to 5.22. Since power dissipation is proportional to the simulation frequency and the worst-case delay is unaffected by the data rate, to account for the different simulation frequencies used for the two supply voltage ranges, the average power and power efficiency are proportionally scaled down by 5 times when the supply voltage is higher than 1.0V.

The degradation in power dissipation and critical delay of Simulation III compared to those of Simulation II is expected because of the dominance of interconnect coupling capacitances and RC parasitic in advanced submicron technology. However, the severity of the degradation is far lower than anticipated due to the improved layout offered by our interconnect-centric design methodology. In fact, the post-layout circuit of our proposed
design has remarkably outperformed the pre-layout conventional circuit in terms of the power dissipation and power efficiency. The worst-case delay of our post-layout circuit is comparable to the pre-layout circuit of the conventional architecture, which has omitted the wire capacitances. Based on the layout experience of our proposed circuit, it would take probably more than 4 man-months to complete the full custom layout of the conventional design with irregular interconnections. Since the post-layout results of our actual circuit have already exceeded the pre-layout results of the conventional design, the physical layout of the conventional design is not conducted. If the parasitic parameters of the actual circuit of the conventional design are available, it is believed that the performance gain of our circuits will be more salient.

Figure 5.20 Comparison of power dissipation (mW/10MHz)
Figure 5.21 Comparison of worst-case delay (ns)

Figure 5.22 Comparison of power efficiency (pJ/10MHz)
Chapter 5

An Area and Energy Efficient IP Core for Scalar Product Computation

5.9 Summary

A new algorithm for the design of a VLSI circuit for scalar product evaluation has been presented. The algorithm produced a novel full bit parallel architecture of scalar product macrocell featuring a low interconnect complexity, improved power efficiency and highly efficient VLSI area utilization. More importantly, the layout regularity and scalability enhance its performance superiority in deep submicron regime well above conventional VLSI design of vector processing unit for scalar product computation. The arithmetic core of the macrocell consists of a partial product generator, the partial product accumulator and the vector accumulator. Some auxiliary circuits including a unique delay detection circuit are proposed to enable accurate delay measurement of the core under the constraint of limited IO pins. The floor planning of the proposed architecture exploits the binary data locality through the border between the multiplication and accumulation operations based on a full combinational logic implementation. A comparison with the layout of the conventional vector multiplier shows that our proposed decomposition algorithm has led to a more compact, regular and modular physical design. A theoretical model for estimating the area and delay has been formulated. Comparing with the conventional architecture of the same capacity, the estimation shows that our design of a 16 bit scalar product multiplier on input vectors of 16 elements achieves a saving of 36.5% of silicon area, up to 59% increase in area usage efficiency and 23% decrease in interconnecting wire delay. The overall performances of average power consumption, worst case delay and power efficiency of our post-layout circuit surplus even the pre-layout circuit of conventional architecture when these circuits are simulated using Synopsys Nanosim over supply voltages from 0.7V to 3.3V based on Chartered CSM 0.18μm CMOS technology. The post-layout simulation shows that the worst case delay of the core at 1.8V is 6.92ns. At 50MHz input data rate, the power dissipation is 65.0mW. The relatively small deviations between the pre- and post-layout simulation results validate the inference of the theoretical estimation that the key contributors to the delay and power reduction of our proposed architecture are the shorter and balanced global and local interconnecting wires, a dominant factor in design consideration for VLSI circuits fabricated in the deep submicron technology.
Chapter 6

Covalent Redundant Binary

Booth Encoded Multiplier

6.1 Introduction

Multiplication is widely used in digital signal processing applications, digital filtering for image, video and audio processing, and thousands of general purpose programs running in all architectures of computers. Therefore, digital multiplier is an obligatory and critical arithmetic unit in general purpose processors, digital signal processors and embedded systems. Its speed often determines how fast the processors can run.

A typical two's complement, also known as normal binary (NB) multiplier consists of four parts, namely the Booth encoder, the partial product generator, the partial product accumulator and the carry propagate adder (CPA). The modified Booth algorithm is adopted to reduce the number of partial products in order to speedup the partial product accumulation. The Booth encoder first recodes the multiplier to signed digit coefficients. The signed digit coefficients are then used to generate the partial products in the partial product generator. The partial products are essentially the integer multiples of the multiplicand, where some hard multiples that can not be realized by simple shifting and complementation operations need to be generated in advance by one or more carry propagate adders. The partial product accumulator, usually implemented in tree structure for high speed multiplier, sums up all the partial products into one carry stored number.
Finally, a CPA is used to convert the carry stored number to normal two’s complement result.

The redundant binary (RB) number system was firstly introduced by Avizienis [AVI61] in 1961. Takagi [TAK85] proposed to apply this new arithmetic for fast multiplication and Edamatsu [EDA88] implemented it in VLSI. The RB system provides the propagation free addition for two RB numbers, making it a promising substitute for NB number system to implement the digital multiplier in the new DSM design paradigm. Besides, communications among redundant binary adders (RBA) within and across different layers of the RBA tree are simpler than those of the full and half adders of the carry save adder (CSA) tree of the NB multiplier. The use of RBA tree for the accumulation of partial products provides better optimization opportunity to highly modular and regular layout. The Booth encoding and the RB partial product generator affect the efficiency of the partial product generation, and the number of partial products that can be saved by this stage impacts the hardware cost and the RB multiplier structure. In this chapter, we scrutinize the existing methods for RB Booth encoder and partial product generator and propose some new algorithms for the design of low power high performance RB multipliers.

The chapter is organized as follows: Section 6.2 introduces the redundant binary systems followed by the structure of the RB multiplier, which can be decomposed into several constituent modules. Section 6.3 explores the existing algorithms and implementations of Booth encoders and partial product generators. An ingenious covalent redundant binary Booth encoding algorithm is proposed in Section 6.4. Its circuit implementations for different radix encoders and accompanying partial product generators are described in Section 6.5. A 54×54-bit multiplier is constructed based on the proposed covalent redundant binary Booth encoder and RB partial product generator, together with the existing RBA circuit for the RBA summing tree and RB-NB converter. The structure of the RB multiplier using the proposed algorithm is compared with the contender structures of other methods. Finally, the simulation results are analyzed and discussed in Section 6.6 and the chapter is closed with a summary in Section 6.7.
6.2 Redundant binary coding and RB multiplier structures

A redundant binary (RB) number belongs to the signed digit number representation [AVI61], where the value of each digit is either "-1", "0", or "1". Therefore at least two binary bits are needed to represent each RB digit. According to the different mapping methods, there are three coding formats in RB number representation. One is the sign-magnitude coding shown in Table 6.1, where one bit, $X^M$, represents the magnitude "0" or "1" while the other, $X^S$, represents the sign, "+" or "-". This coding format can be expressed as $(X^S, X^M)$. Negating a RB digit is equivalent to complementing the sign bit, i.e., $-(X^S, X^M) = (X^S, X^M)$.

Table 6.1 Sign-magnitude coding

<table>
<thead>
<tr>
<th>Coding</th>
<th>RB digit</th>
</tr>
</thead>
<tbody>
<tr>
<td>$(X^S, X^M)$</td>
<td>$d$</td>
</tr>
<tr>
<td>(0, 0)</td>
<td>0</td>
</tr>
<tr>
<td>(0, 1)</td>
<td>1</td>
</tr>
<tr>
<td>(1, 0)</td>
<td>0</td>
</tr>
<tr>
<td>(1, 1)</td>
<td>1</td>
</tr>
</tbody>
</table>

The second format is the positive-negative coding. The relationship of the represented value $d$ and its coding bits $X^+, X^-$ is given in (6.1).

$$d = X^+ - X^-$$  \hspace{1cm} (6.1)

The positive-negative coding is popularly used by recent published papers due to the ease of its generation from the normal binary numbers. Table 6.2 shows the positive-negative coding in $(X^+, X^-)$. Negating a digit can be simply obtained by exchanging the two coding bits, i.e., $-(X^+, X^-) = (X^-, X^+)$.

Table 6.2 Positive-negative coding

<table>
<thead>
<tr>
<th>Coding</th>
<th>RB digit</th>
</tr>
</thead>
<tbody>
<tr>
<td>$(X^+, X^-)$</td>
<td>$d$</td>
</tr>
<tr>
<td>(0, 0)</td>
<td>0</td>
</tr>
<tr>
<td>(0, 1)</td>
<td>1</td>
</tr>
<tr>
<td>(1, 0)</td>
<td>1</td>
</tr>
<tr>
<td>(1, 1)</td>
<td>0</td>
</tr>
</tbody>
</table>
Chapter 6  
Covalent Redundant Binary Booth Encoded Multiplier

The remaining format is the positive-negative complement coding. The relationship between the value of the digit and its coding bits is given by (6.2)

\[ d = X^+ - X^- \]  

(6.2)

Table 6.3 shows the signed digits represented by this coding. Negating a RB digit is equivalent to complementing both coding bits, i.e., \( -(X^+, X^-) = (X^+, X^-) \)

<table>
<thead>
<tr>
<th>Coding ((X^+, X^-))</th>
<th>RB digit (d)</th>
</tr>
</thead>
<tbody>
<tr>
<td>(0, 0)</td>
<td>1</td>
</tr>
<tr>
<td>(0, 1)</td>
<td>0</td>
</tr>
<tr>
<td>(1, 0)</td>
<td>0</td>
</tr>
<tr>
<td>(1, 1)</td>
<td>1</td>
</tr>
</tbody>
</table>

Since positive-negative coding is the most frequently used coding format for RB multiplication, the multipliers and their algorithms and architectures discussed in this chapter are designed based on this coding format.

As the peripheral interfaces of most digital systems are based on the normal binary number system, the RB multiplier needs to convert the normal binary multiplicand to redundant binary number. After obtaining the product in the redundant binary format, it is necessary to convert it back to the normal binary format. Therefore a RB Booth’s multiplier is composed of four parts, which are the Booth encoder to generate the signed digit coefficients for the RB multiples, the RB partial product generator to convert the NB multiplicand to RB partial products according to the signed digit coefficients, the RB partial product summing tree to compress the partial products to one RB result, and the RB to NB converter to output the result in normal binary format for standard bus interface. The structure of the RB multiplier is shown in Fig. 6.1(b). For comparison, the structure of the NB multiplier is shown in Fig. 6.1(a).
Although the RB summing tree is in general more costly to implement, the Booth encoder and the RB partial product generator determine how efficient these RB partial products are generated, and therefore indirectly dictate the amount of hardware needed to construct the multiplier. Before explaining our proposed algorithm, it is necessary to explore the existing methods for Booth encoding and RB partial product generation.

6.3 Existing Booth encoding and partial product generation methods for RB multiplication

The Booth algorithm [BOO51] is an efficient way to reduce the number of partial products in fast signed digital multiplier because it groups any series of “1”s in one of the operands called the multiplier (the other operand is called the multiplicand). The modified Booth encoding algorithm [MAC61] is a parallel counterpart of the serial Booth encoding. Therefore, it is more suitable for the design and implementation of high-speed digital multiplier in hardware. For convenience, we refer to the modified Booth encoding as Booth encoding. As the radix value, \(2^N\) \((N = 1, 2, 3...\) of the Booth-\(N\) encoding increases, the number of recoded Booth digits for the multiplier also decreases by a factor to \(1/N\), so does the number of partial products.
6.3.1 Normal Booth encoding

In Booth-N encoding, a Booth-N digit $d_i$ is equivalent to $N$ normal binary bits. Besides the $N$ bits to be encoded, the generation of a Booth-N digit requires one additional bit to be borrowed from the bit to the right of the LSB of this $N$-bit string. If $b_{i+N,N-1} b_{i+N-2} \ldots b_{i-1}$ are the multiplier bits to be encoded, then $b_{i-1}$ will be the borrowed bit, which is also called the overlapping bit. The encoded digit has the value given by (6.3).

$$d_i = -2^{N-1} b_{i+N-1} + \sum_{n=0}^{N-2} 2^n b_{i+n} + b_{i-1}$$  \hspace{1cm} (6.3)

Tables 6.4 to 6.7 show the Booth encoded digits and their corresponding binary bits with the overlapping bit in bracket. The multiples represented by the Booth encoded digits are also shown in the tables, where $M$ denotes the multiplicand.

When $N = 1$, the Booth-1 digit $d_i$ is converted from $b_i(b_{i-1})$. The encoded digit has the value given by:

$$d_i = -b_i + b_{i-1}$$  \hspace{1cm} (6.4)

The possible digits are 1, 0 and 1. Their mapping is listed in Table 6.4.

<table>
<thead>
<tr>
<th>Normal Binary $b_i(b_{i-1})$</th>
<th>Booth-1 Digit $d_i$</th>
<th>Multiple $xM$</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0 1</td>
<td>1</td>
<td>$1M$</td>
</tr>
<tr>
<td>1 0</td>
<td>1</td>
<td>$-1M$</td>
</tr>
<tr>
<td>1 1</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>

Booth-1 encoding is seldom used in normal Booth multiplier because it does not help to reduce the number of partial products compared with non-Booth encoded multiplier.

In Booth-2 encoding, $N = 2$, and every Booth-2 digit, $D_i$ is mapped from the bits $b_{i+1} b_i(b_{i-1})$. Therefore, the value of the encoded digit $D_i$ is given by:

$$D_i = -2b_{i+1} + b_i + b_{i-1}$$  \hspace{1cm} (6.5)
The legitimate digits in Booth-2 encoded number are 2, 1, 0, 1 and 2. The mapping from the normal binary bits to the Booth-1 digits and the Booth-2 digit are shown in Table 6.5.

<table>
<thead>
<tr>
<th>Normal Binary $b_{i+1} b_i (b_{i-1})$</th>
<th>Booth-1 Digits $d_{i+1} d_i$</th>
<th>Booth-2 Digit $D_i$</th>
<th>Multiple $xM$</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 0</td>
<td>0 0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0 0 1</td>
<td>0 1</td>
<td>1</td>
<td>1M</td>
</tr>
<tr>
<td>0 1 0</td>
<td>1 1</td>
<td>1</td>
<td>1M</td>
</tr>
<tr>
<td>0 1 1</td>
<td>1 0</td>
<td>2</td>
<td>2M</td>
</tr>
<tr>
<td>1 0 0</td>
<td>1 0</td>
<td>2</td>
<td>-2M</td>
</tr>
<tr>
<td>1 0 1</td>
<td>1 1</td>
<td>1</td>
<td>-1M</td>
</tr>
<tr>
<td>1 1 0</td>
<td>0 1</td>
<td>1</td>
<td>-1M</td>
</tr>
<tr>
<td>1 1 1</td>
<td>0 0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>

In Booth-3 and Booth-4 encoding, the values of the encoded digits are expressed by (6.6) and (6.7), respectively.

\[
d_i = -2^2 b_{i+2} + 2 b_{i+1} + b_i + b_{i-1}
\]

\[
d_i = -2^3 b_{i+3} + 2^2 b_{i+2} + 2 b_{i+1} + b_i + b_{i-1}
\] (6.7)

Tables 6.6 and 6.7 show the normal binary bits and their equivalent multiples for Booth-3 and Booth-4 encoding, respectively, where the hard multiples are marked with "**".

<table>
<thead>
<tr>
<th>Normal Binary $b_{i+2} b_{i+1} b_i (b_{i-1})$</th>
<th>Multiple $xM$</th>
<th>Normal Binary $b_{i+2} b_{i+1} b_i (b_{i-1})$</th>
<th>Multiple $xM$</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 0 (0)</td>
<td>0</td>
<td>1 0 0 (0)</td>
<td>-4M</td>
</tr>
<tr>
<td>0 0 0 (1)</td>
<td>1M</td>
<td>1 0 0 (1)</td>
<td>-3M*</td>
</tr>
<tr>
<td>0 0 1 (0)</td>
<td>1M</td>
<td>1 0 1 (0)</td>
<td>-3M*</td>
</tr>
<tr>
<td>0 0 1 (1)</td>
<td>2M</td>
<td>1 0 1 (1)</td>
<td>-2M</td>
</tr>
<tr>
<td>0 1 0 (0)</td>
<td>2M</td>
<td>1 1 0 (0)</td>
<td>-2M</td>
</tr>
<tr>
<td>0 1 0 (1)</td>
<td>3M*</td>
<td>1 1 0 (1)</td>
<td>-M</td>
</tr>
<tr>
<td>0 1 1 (0)</td>
<td>3M*</td>
<td>1 1 1 (0)</td>
<td>-M</td>
</tr>
<tr>
<td>0 1 1 (1)</td>
<td>4M</td>
<td>1 1 1 (1)</td>
<td>-0</td>
</tr>
</tbody>
</table>
Table 6.7 Booth-4 encoding

<table>
<thead>
<tr>
<th>Normal Binary $b_{i+3} b_{i+2} b_{i+1} b_i (b_{i-1})$</th>
<th>Multiple $xM$</th>
<th>Normal Binary $b_{i+3} b_{i+2} b_{i+1} b_i (b_{i-1})$</th>
<th>Multiple $xM$</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 0 0 (0)</td>
<td>0</td>
<td>1 0 0 0 (0)</td>
<td>$-8M$</td>
</tr>
<tr>
<td>0 0 0 0 (1)</td>
<td>$M$</td>
<td>1 0 0 0 (1)</td>
<td>$-7M^*$</td>
</tr>
<tr>
<td>0 0 0 1 (0)</td>
<td>$M$</td>
<td>1 0 0 1 (0)</td>
<td>$-7M^*$</td>
</tr>
<tr>
<td>0 0 0 1 (1)</td>
<td>$2M$</td>
<td>1 0 0 1 (1)</td>
<td>$-6M^*$</td>
</tr>
<tr>
<td>0 0 1 0 (0)</td>
<td>$2M$</td>
<td>1 0 1 0 (0)</td>
<td>$-6M^*$</td>
</tr>
<tr>
<td>0 0 1 0 (1)</td>
<td>$3M^*$</td>
<td>1 0 1 0 (1)</td>
<td>$-5M^*$</td>
</tr>
<tr>
<td>0 0 1 1 (0)</td>
<td>$3M^*$</td>
<td>1 0 1 1 (0)</td>
<td>$-5M^*$</td>
</tr>
<tr>
<td>0 1 0 0 (0)</td>
<td>$4M$</td>
<td>1 0 1 1 (0)</td>
<td>$-4M$</td>
</tr>
<tr>
<td>0 1 0 0 (1)</td>
<td>$5M^*$</td>
<td>1 1 0 0 (0)</td>
<td>$-3M^*$</td>
</tr>
<tr>
<td>0 1 0 1 (0)</td>
<td>$5M^*$</td>
<td>1 1 0 1 (0)</td>
<td>$-3M^*$</td>
</tr>
<tr>
<td>0 1 0 1 (1)</td>
<td>$6M^*$</td>
<td>1 1 0 1 (0)</td>
<td>$-2M$</td>
</tr>
<tr>
<td>0 1 1 0 (0)</td>
<td>$6M^*$</td>
<td>1 1 1 0 (0)</td>
<td>$-2M$</td>
</tr>
<tr>
<td>0 1 1 0 (1)</td>
<td>$7M^*$</td>
<td>1 1 1 1 (0)</td>
<td>$-M$</td>
</tr>
<tr>
<td>0 1 1 1 (0)</td>
<td>$7M^*$</td>
<td>1 1 1 1 (0)</td>
<td>$-M$</td>
</tr>
<tr>
<td>0 1 1 1 (1)</td>
<td>$8M$</td>
<td>1 1 1 1 (1)</td>
<td>$-0$</td>
</tr>
</tbody>
</table>

Intuitively, selecting a higher radix Booth algorithm to encode the multiplier would reduce more partial products for a faster multiplier to be constructed. However, a close examination of Tables 6.6 and 6.7 will reveal that the number of multiples increases commensurately with the radix to $2^N + 1$. Besides, the number of hard multiples which are not the powers of two also increases. For example, in Booth-3 encoding, there are two hard multiples, $3M$ and $-3M$ out of a total of nine distinct multiples while in Booth-4 encoding, there are eight hard multiples, which are $±3M$, $±5M$, $±6M$, $±7M$ out of seventeen distinct multiples. All these hard multiples cannot be obtained by simple shifting and/or complementation operations on the multiplicand. Additional time consuming carry propagate adders are required to generate them before the partial products are generated. The advantage of Booth-3 and higher radix Booth encoding is compromised due to the long delay and complex decoding logic required for the generation of the hard multiples.
6.3.2 NBE-RBPPG: RB partial product generation with normal Booth encoding

An RB partial product, $R$, can be obtained by two NB partial products, $A$ and $B$ with the definition of the positive-negative coding format. This method of generating RB partial products is abbreviated as NBE-RBPPG.

$$ R = A + B = A - (-B) $$

(6.8)

Since $-B = B + 1$, substituting it into (6.8) gives:

$$ R = A - (B + 1) = A - B - 1 $$

(6.9)

If $A$ and $B$ are $K$-bit two’s complement numbers, then

$$ A = -2^{K-1}a_{K-1} + \sum_{i=0}^{K-2}2^i a_i $$

(6.10)

and

$$ B = -2^{K-1}b_{K-1} + \sum_{i=0}^{K-2}2^i b_i $$

(6.11)

The number of RB digits of the RB partial product is also $K$.

$$ R = \left( -2^{K-1}a_{K-1} + \sum_{i=0}^{K-2}2^i a_i \right) - \left( -2^{K-1}b_{K-1} + \sum_{i=0}^{K-2}2^i b_i \right) - 1 $$

$$ = 2^{K-1}(b_{K-1} - a_{K-1}) + \sum_{i=0}^{K-2}2^i(a_i - b_i) - 1 $$

(6.12)

It should be noted that the most significant digit has been negated. The RB digit string of the partial product, $R$ expressed in positive-negative coding format is:

$$ (b_{K-1}, a_{K-1}) (a_{K-1}, b_{K-1}) \cdots (a_1, b_1)(a_0, b_0) + (0, 1) $$

(6.13)

If $A$ and $B$ are unsigned numbers used as the mantissa of the floating point number in compliance with the IEEE 754 standard, then the RB partial product is

$$ R = \left( \sum_{i=0}^{K-1}2^i a_i \right) - \left( \sum_{i=0}^{K-1}2^i b_i \right) - 1 = \sum_{i=0}^{K-2}2^i(a_i - b_i) - 1 $$

(6.14)

The RB number, expressed in positive-negative format is

$$ (a_{K-1}, b_{K-1})(a_{K-1}, b_{K-1}) \cdots (a_1, b_1)(a_0, b_0) + (0, 1) $$

(6.15)
[MAK96] used this method and the normal Booth encoder to implement a 54×54-bit RB multiplier. The two NB partial products generated by two adjacent Booth encoders compose one RB partial product. Due to the limitation of the hard multiples, only up to Booth-2 encoding are used. The algorithm is straightforward, but every RB partial products requires one constant RB digit (0,1) to correct its value, as indicated in (6.13) and (6.15). All the correction constants of the RB partial products can be accumulated to form a new RB partial product, called the RB correction vector. For the Booth-N multiplier of word length K, the number of RB partial products generated is:

\[ \left\lfloor \frac{K}{N} \right\rfloor + 1. \]

The RB correction vector incurs additional hardware for its accumulation. It can even increase the number of stages of the summing tree, if the word length of the multiplier is 2^n, such as the multipliers in single extended and double extended floating point number formats, whose effective mantissa are 32 and 64, respectively. Consequently, the power dissipation and worst case delay are also degraded by the inclusion of the correction vector.

Fig. 6.2 shows a pair of normal Booth-2 encoders used to generate the coefficients of the RB partial products. As the encoder outputs the sign and magnitude signals, the NB partial products are generated from the non-negative multiples of the multiplicand. To obtain the negative NB multiples, the positive multiples need to be complemented followed by an addition of 1. Therefore, it is necessary to send both sign signals of the Booth encoders to correct the RB partial product.
Fig. 6.3 shows one digit of the RB partial product generator, which is composed of two one bit NB partial product generators.

**Figure 6.3 One digit redundant binary partial product generator for NBE-RBPPG**

### 6.3.3 RBSD Booth encoding and partial product generation

To address the problem of generating hard multiples in high-radix Booth encoding, Besli and Deshmukh [BES02(1), BES02(2)] noticed that some multiples can be obtained by subtracting one non-hard multiple from another. The partial products generated in this...
manner fit the format of the positive-negative RB coding. This is realized by a different Booth encoding logic, called the RB Signed Digit (RBSD) Booth encoding.

Table 6.8 shows the RBSD Booth-3 encoding, where the original hard multiples \( \pm 3M \) are replaced by \( \pm (4M - M) \). Table 6.9 shows the RBSD Booth-4 encoding.

**Table 6.8 RBSD Booth-3 encoding**

<table>
<thead>
<tr>
<th>Normal Binary ( b_{i+2} b_{i+1} b_i (b_{i-1}) )</th>
<th>Multiple</th>
<th>Normal Binary ( b_{i+2} b_{i+1} b_i (b_{i-1}) )</th>
<th>Multiple</th>
</tr>
</thead>
<tbody>
<tr>
<td>+M</td>
<td>-M</td>
<td>+M</td>
<td>-M</td>
</tr>
<tr>
<td>0 0 0 (0)</td>
<td>0</td>
<td>0</td>
<td>1 0 0 (0)</td>
</tr>
<tr>
<td>0 0 0 (1)</td>
<td>( M )</td>
<td>0</td>
<td>1 0 0 (1)</td>
</tr>
<tr>
<td>0 0 1 (0)</td>
<td>( M )</td>
<td>0</td>
<td>1 0 1 (0)</td>
</tr>
<tr>
<td>0 0 1 (1)</td>
<td>2M</td>
<td>0</td>
<td>1 0 1 (1)</td>
</tr>
<tr>
<td>0 1 0 (0)</td>
<td>2M</td>
<td>0</td>
<td>1 1 0 (0)</td>
</tr>
<tr>
<td>0 1 0 (1)</td>
<td>4M</td>
<td>( M )</td>
<td>1 1 0 (1)</td>
</tr>
<tr>
<td>0 1 1 (0)</td>
<td>4M</td>
<td>( M )</td>
<td>1 1 1 (0)</td>
</tr>
<tr>
<td>0 1 1 (1)</td>
<td>4M</td>
<td>0</td>
<td>1 1 1 (1)</td>
</tr>
</tbody>
</table>

**Table 6.9 RBSD Booth-4 encoding**

<table>
<thead>
<tr>
<th>Normal Binary ( b_{i+3} b_{i+2} b_{i+1} b_i (b_{i-1}) )</th>
<th>Multiple</th>
<th>Normal Binary ( b_{i+3} b_{i+2} b_{i+1} b_i (b_{i-1}) )</th>
<th>Multiple</th>
</tr>
</thead>
<tbody>
<tr>
<td>+M</td>
<td>-M</td>
<td>+M</td>
<td>-M</td>
</tr>
<tr>
<td>0 0 0 0 (0)</td>
<td>0</td>
<td>0</td>
<td>1 0 0 0 (0)</td>
</tr>
<tr>
<td>0 0 0 0 (1)</td>
<td>( M )</td>
<td>0</td>
<td>1 0 0 0 (1)</td>
</tr>
<tr>
<td>0 0 0 1 (0)</td>
<td>( M )</td>
<td>0</td>
<td>1 0 0 1 (0)</td>
</tr>
<tr>
<td>0 0 0 1 (1)</td>
<td>2M</td>
<td>0</td>
<td>1 0 0 1 (1)</td>
</tr>
<tr>
<td>0 0 1 0 (0)</td>
<td>2M</td>
<td>0</td>
<td>1 0 1 0 (0)</td>
</tr>
<tr>
<td>0 0 1 0 (1)</td>
<td>4M</td>
<td>( M )</td>
<td>1 0 1 0 (1)</td>
</tr>
<tr>
<td>0 0 1 1 (0)</td>
<td>4M</td>
<td>( M )</td>
<td>1 0 1 1 (0)</td>
</tr>
<tr>
<td>0 0 1 1 (1)</td>
<td>4M</td>
<td>0</td>
<td>1 0 1 1 (1)</td>
</tr>
<tr>
<td>0 1 0 0 (0)</td>
<td>4M</td>
<td>0</td>
<td>1 1 0 0 (0)</td>
</tr>
<tr>
<td>0 1 0 0 (1)</td>
<td>5M*</td>
<td>0</td>
<td>1 1 0 0 (1)</td>
</tr>
<tr>
<td>0 1 0 1 (0)</td>
<td>5M*</td>
<td>0</td>
<td>1 1 0 1 (0)</td>
</tr>
<tr>
<td>0 1 0 1 (1)</td>
<td>8M</td>
<td>2M</td>
<td>1 1 0 1 (1)</td>
</tr>
<tr>
<td>0 1 1 0 (0)</td>
<td>8M</td>
<td>2M</td>
<td>1 1 1 0 (0)</td>
</tr>
<tr>
<td>0 1 1 0 (1)</td>
<td>8M</td>
<td>( M )</td>
<td>1 1 1 0 (1)</td>
</tr>
<tr>
<td>0 1 1 1 (0)</td>
<td>8M</td>
<td>( M )</td>
<td>1 1 1 1 (0)</td>
</tr>
<tr>
<td>0 1 1 1 (1)</td>
<td>8M</td>
<td>0</td>
<td>1 1 1 1 (1)</td>
</tr>
</tbody>
</table>
Chapter 6  Covalent Redundant Binary Booth Encoded Multiplier

Among the four hard multiples in the original Booth-4 encoding, 3M, 6M and 7M are easily obtained by the combination of two non-hard multiples in positive-negative pair. The only exception is the hard multiple 5M (marked by "*" in Table 6.9), which cannot be generated in this manner. Therefore, additional hardware is necessary to generate the 5M multiple. A simple RB adder can be used to form the 5M multiple by adding 4M and M as shown in Fig. 6.4 [BES02(2)]. Fortunately, this RB addition is carry-free and it does not lie in the critical path of the RBSD Booth encoder and RB partial product generator circuit.

![Figure 6.4 RB adder for generating the k-th RB digit of the 5M hard multiple](image)

Fig. 6.5 to 6.7 show the RBSD Booth-2 to Booth-4 encoders and the RB partial product (RBPP) generators, where pp$^+$ and pp$^-$ are the two coding bits of the k-th RB digit of the partial product.

![Figure 6.5 RBSD Booth-2 encoder and RBPP generator](image)
It is obvious that the RBSD Booth-3 and Booth-4 encoders are much more complex, which use high fan-in gates, especially in the partial product generator circuit. Since the circuit for each digit of the RB partial product will be duplicated in a large number, more hardware is incurred.
On the first face, it appears that the RBSD encoding is a "perfect" solution as there is no correction vector needed for the RB partial products, and the area and time consuming hardware for generating the hard multiples in high radix Booth encoding have been eliminated. However, the cost of achieving the above merits is an increase (almost doubling) in the number of RB partial products. This is because only one Booth encoded coefficient is used for generating a RB partial product as opposed to the earlier method where two Booth encoded coefficients can be exhausted to generate one RB partial product. Half of the binary bits representing the RB partial product generated from the non-hard multiple in the RBSD encoding are filled with "0"s, which is very inefficient.

For the RBSD Booth-$N$ multiplier of word length $K$, the total number of RB partial products generated is:
which is almost twice that of the previous NBE method. One extra layer of RB partial product summing tree is needed, which increases both the power dissipation and the worst case delay.

6.4 CRBBE: Proposed Booth encoding and partial product generation algorithms for RB multiplication

In this section, we propose a novel Covalent Redundant Binary Booth Encoding (CRBBE) algorithm to overcome the shortcoming of the RBSD Booth encoding method presented in the last section. Our algorithm utilizes the characteristics of the Booth encoded numbers to generate a reduced number of RB partial products without inducing any RB correction vector.

6.4.1 CRBBE-1: the Booth-1 encoding for RB multiplication

From (6.4), any pair of contiguous encoded digits $d_{i+1}d_i$ of Booth-1 encoding, which is mapped from two adjacent bits of the normal binary number $b_{i+1}(b_{(i+1)-1})$ and $b_i(b_{i-1})$ (refer to Table 6.4 for the mapping) can never exhaust all possible combinations of two digits from the set \{0, 1, $\overline{1}$\}. This is because of the constraint that the borrowed bit from which the digit $d_{i+1}$ is mapped must come from the MSB of the binary bits from which its right contiguous digit $d_i$ is mapped, i.e. $b_{(i+1)-1} = b_i$. Table 6.10 shows the permissible combinations of contiguous digit pairs in Booth-1 encoded number. They are grouped into four categories according to the left digit $d_{i+1}$.

| Table 6.10 Permissible pairs of contiguous digits ($d_{i+1}d_i$) in Booth-1 encoded number |
|--------------------------------------|---------------------------------|------------------|------------------|------------------|
| $d_{i+1}=1$ | $d_{i+1}=0$ | $d_{i+1}=0$ | $d_{i+1}=\overline{1}$ |
| \( \bar{0} \) | \( \bar{0} \) | \( \bar{0} \) | \( \bar{1} \) |
| \( \overline{1} \) | \( \overline{1} \) | \( \overline{1} \) | \( \overline{0} \) |
In a redundant binary multiplier, a RB partial product can be formed straightforwardly by grouping the same weighted bits from two normal binary partial products generated from a pair of adjacent Booth encoders using any of the coding formats presented in Section 6.2. Thus, if two adjacent normal binary Booth encoders always generate signed digit coefficients of opposite signs, their corresponding NB partial products can easily be combined to form a single positive-negative coded RB partial product without any correction digit. This is possible if the contiguous digits of the Booth encoded multiplier alternate in signs. We call this algorithm to generate a compound RB partial product from two Booth encoded digits the Covalent Redundant Binary Booth encoding (CRBBE) for its analogy to the way a covalent compound is formed from charge sharing. From Table 6.10, the criterion of having opposite polarity contiguous encoded digits can be fulfilled by rewriting the duplets such that one digit of the pair is positive and the other digit is negative without changing the value of the compound multiple, i.e., \( 2d_{i+1} + d_i \) remains unchanged. Some duplets in the columns \( d_{i+1}=0 \) and \( d_{i+1}=\bar{0} \) of Table 6.10 need to be rewritten. Using the fact that the digit 0 is neutral and is equivalent to \( 0 \). We can reformat the duplets \( 01 \) to \( \bar{0}1 \), \( 00 \) to \( \bar{0}0 \), \( \bar{0}0 \) to \( 00 \), and \( \bar{0}1 \) to \( 01 \) to obtain the required positive-negative coefficient pair without changing the values of their compound multiples. Table 6.11 shows the resultant contiguous digit pairs for the proposed Covalent RB Booth-1 (CRBBE-1) encoding.

<table>
<thead>
<tr>
<th>( d_{i+1}=1 )</th>
<th>( d_{i+1}=0 )</th>
<th>( d_{i+1}=\bar{0} )</th>
<th>( d_{i+1}=\bar{1} )</th>
</tr>
</thead>
<tbody>
<tr>
<td>( 10 = (2, 0) )</td>
<td>( 01 = \bar{0}1 = (1, 0) )</td>
<td>( 00 = \bar{0}0 = (0, 0) )</td>
<td>( 11 = \bar{1}1 = (1, 2) )</td>
</tr>
<tr>
<td>( 1\bar{1} = (2, 1) )</td>
<td>( 0\bar{0} = \bar{0}0 = (0, 0) )</td>
<td>( \bar{0}1 = (0, 1) )</td>
<td>( \bar{1}0 = (0, 2) )</td>
</tr>
</tbody>
</table>

In Table 6.11, the compound signed digit coefficient, \( r_i = (r_i^+, r_i^-) = r_i^+ - r_i^- \), generated by each duplet, \( d_{i+1}d_i \), is also shown in bracket. The shaded cells in Table 6.11 represent the positive-negative pair while the unshaded cells represent the negative-positive pair. For every positive-negative digit pair in the shaded cells, there will be a corresponding negative-positive digit pair in the unshaded cells with their resultant multiples differ only
in sign. This property can be used to greatly simplify the hardware for the CRBBE circuit so that only the absolute values of the coefficients need to be generated.

Thus, the values of \( r^+ \) and \( r^- \) are non-negative as obtained from the outputs of the adjacent Booth encoders. Negative compound coefficient can be obtained from its positive counterpart by simply swapping the values of \( r^+ \) and \( r^- \). To simplify the design of the Booth encoder, the signs of the digits \( d_{i+1} \) and \( d_i \) are ignored at the outset. The sign of the least significant digit \( d_t \) in the duplet is then used to determine if the swapping of the coefficients \( r^+_t \) and \( r^-_t \) is necessary. The criterion for the ordering of the coefficients is given by (6.16):

\[
(r^+_t, r^-_t) = \begin{cases} 
(2|d_{i+1}|, |d_i|) & \text{if } d_i \text{ is negative} \\
(|d_{i+1}|, 2|d_i|) & \text{if } d_i \text{ is positive}
\end{cases}
\] (6.16)

**6.4.2 CRBBE-2: the Booth-2 encoding for RB multiplication**

Based on the similar rationales as in the case of the Booth-1 encoding, two contiguous digits, \( D_{i+1}D_i \), of Booth-2 encoding mapping from three contiguous bits \( b_{2(i+1)} b_{2(i+1)} (b_{2(i+1)}=1) \) and \( b_{2i+1} b_{2i} (b_{2i}=1) \) of the normal binary multiplier have also some restrictions on its legal combinations, which are shown in Table 6.12.

**Table 6.12 Permissible pairs of contiguous digits \((D_{i+1}, D_i)\) in Booth-2 encoded number**

<table>
<thead>
<tr>
<th>( D_{i+1}=2 )</th>
<th>( D_{i+1}=1 )</th>
<th>( D_{i+1}=0 )</th>
<th>( D_{i+1}=1 )</th>
<th>( D_{i+1}=2 )</th>
</tr>
</thead>
<tbody>
<tr>
<td>20</td>
<td>12</td>
<td>02</td>
<td>00</td>
<td>12</td>
</tr>
<tr>
<td>21</td>
<td>11</td>
<td>01</td>
<td>01</td>
<td>11</td>
</tr>
<tr>
<td>22</td>
<td>10</td>
<td>00</td>
<td>02</td>
<td>10</td>
</tr>
<tr>
<td>10</td>
<td>20</td>
<td>11</td>
<td>10</td>
<td>21</td>
</tr>
<tr>
<td>11</td>
<td>12</td>
<td>11</td>
<td>12</td>
<td>12</td>
</tr>
</tbody>
</table>

Most digit pairs in Table 6.12 have a positive-negative or negative-positive format. Those that are not can be easily changed to one of these formats, except for the duplets 12, 11, and 10 in the column \( D_{i+1}=1 \), and the duplets 10, 11 and 12 in the column \( D_{i+1}=1 \). The
duplets $12$ and $\overline{12}$ can be rewritten as $2\overline{2}$ and $\overline{2}\overline{2}$, respectively to form the required format while maintaining the same compound coefficient value. Similarly, the duplets $10$ and $\overline{10}$ can be rewritten to $1\overline{0}$ and $\overline{1}0$, respectively. However, the duplets $11$ and $\overline{11}$, which represent the compound coefficients for the RB multiples of $5M$ and $-5M$, respectively, cannot be easily reformatted to any of the desirable positive-negative pairs. To facilitate the high speed partial product generation in hardware design, the hard RB multiple of $5M$ needs to be calculated in advance by dedicated circuit.

The duplets in columns $D_{r+1}=0$ and $D_{r+1}=\overline{0}$ do not have the desirable positive-negative format either. These cases are trivial as the required format can be obtained by replacing $0$ with $\overline{0}$ and $\overline{0}$ with $0$ in the $D_{r+1}$ position. It does not make the circuit design more complex if we treat the outputs of the Booth encoders as the absolute value of the compound coefficient ($r^+, r^-$) and use the sign of $D_{r+1}$ to determine if it belongs to the positive-negative or negative-positive format similar to what have been done for CRBBE-1 encoding. Table 6.13 shows the reformatted duplets for the proposed Covalent Redundant Binary Booth-2 Encoding (CRBBE-2). The duplets $11$ and $\overline{11}$ correspond to the two hard multiples are marked by "**" to indicate that they need to be preprocessed independently.

| Table 6.13 Proposed Covalent RB Booth-2 encoded duplets ($D_{r+1}D_i$) |
|---|---|---|---|---|---|
| $D_{r+1}=2$ | $D_{r+1}=1$ | $D_{r+1}=0$ | $D_{r+1}=\overline{0}$ | $D_{r+1}=1$ | $D_{r+1}=\overline{2}$ |
| 20 | 21 | 22 | 02=02 | 00=00 | 12 |
| 12=22 | 11* | 10=10 | 01=01 | 01=01 | \overline{1}1 |
| 10=10 | 10=10 | \overline{1}0 = 10 | 02=02 | 02=02 | \overline{10} = 10 |
| 11 | 12 | \overline{11} | 12=22 | 12=22 | \overline{12} = 22 |

From Table 6.13, the coefficient reordering criterion is given by (6.17):

$$\left(r^+ , r^- \right) \begin{cases} 
(4|D_i|, |D_i|) & \text{if } D_{r+1} \text{ is positive, except 11} \\
(|D_i|, 4|D_{r+1}|) & \text{if } D_{r+1} \text{ is negative, except 11}
\end{cases}$$  \hspace{1cm} (6.17)
6.4.3 CRBBE-1.5: A mixture of CRBBE-1 and CRBBE-2

The proposed CRBBE-1 and CRBBE-2 algorithms use either two Booth-1 or two Booth-2 encoders to generate one compound RB partial product. To avoid using the hardware to generate the hard RB multiples of $5M$ and $-5M$, it is possible to use one Booth-1 encoder and one Booth-2 encoder to generate the compound coefficient. This mitigative method is called CRBBE-1.5. Suppose the binary bits $b_{i+1}$ and $b_i$ are recoded to a compound coefficient $d_{i+1}d_i$ in a pair of adjacent Booth-1 and Booth-2 encoders, with the Booth-1 encoder placed before the Booth-2 encoder. The legal duplets are shown in Table 6.14.

<table>
<thead>
<tr>
<th>$d_{i+1}$</th>
<th>$d_i$</th>
<th>$d_{i+1} = 0$</th>
<th>$d_{i+1} = 1$</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>00</td>
<td>00</td>
<td>10</td>
</tr>
<tr>
<td>11</td>
<td>01</td>
<td>01</td>
<td>11</td>
</tr>
<tr>
<td>12</td>
<td>02</td>
<td>02</td>
<td>12</td>
</tr>
</tbody>
</table>

Both digits of the duplets in columns $d_{i+1} = 0$ and $d_{i+1} = \overline{0}$ have the same sign. They can be easily converted to the desirable format by rewriting the most significant digit, $d_{i+1}$, from 0 to $\overline{0}$, or vice versa, as what have been done in CRBBE-1. After which, all duplets will have the desirable positive-negative or negative-positive coding, as shown in Table 6.15.

<table>
<thead>
<tr>
<th>$d_{i+1} = 1$</th>
<th>$d_{i+1} = 0$</th>
<th>$d_{i+1} = 0$</th>
<th>$d_{i+1} = 1$</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>00 = 00</td>
<td>00 = 00</td>
<td>10</td>
</tr>
<tr>
<td>11</td>
<td>01 = 01</td>
<td>01 = 01</td>
<td>11</td>
</tr>
<tr>
<td>12</td>
<td>02 = 02</td>
<td>02 = 02</td>
<td>12</td>
</tr>
</tbody>
</table>

Therefore, the swapping criterion of the compound coefficient is given by (6.18), where the sign of the least significant digit $d_i$ is used to identify the positive-negative or negative-positive format.
Chapter 6  
Covalent Redundant Binary Booth Encoded Multiplier

\[ (r^+, r^-) = \begin{cases} (4|d_i|, |d_i|) & \text{if } d_i \text{ is negative} \\ (|d_i|, 4|d_i|) & \text{if } d_i \text{ is positive} \end{cases} \]  
(6.18)

Like the Normal Booth Encoder, Redundant Binary Partial Product Generator (NBE-RBPPG), the proposed Covalent Redundant Binary Booth Encoder (CRBBE) algorithm also composes every two NB partial products to one RB partial product, and generates the RB partial products with even higher efficiency than that of the NBE-RBPPG method. For a Booth-N multiplier of word length \( K \), the number of RB partial products generated using CRBBE is:

\[ \left\lceil \frac{K}{N} \right\rceil \]

which is in so far the best of all methods discussed in this chapter. It is thankful that the proposed algorithm does not generate any additional correction vector to the summing tree. Therefore, lesser hardware and shorter worst case delay are assured.

6.5 Porting of CRBBE algorithm to RB multiplier architecture

The RB multipliers in this chapter are implemented with the positive-negative coding, which is easier for hardware design.

6.5.1 Circuit implementation of CRBBE encoders and RB PPGs

According to the proposed Covalent Booth encoding algorithm, the RB partial product generators are only responsible for the generation of the absolute values of the needed multiples, such as \( 1M \) or \( 0M \) in Booth-1 encoding according to the compound coefficients \( r = (r^+, r^-) \). Therefore, no additional hardware is required to realize the reformatting from 0 to \( \overline{0} \), or vise versa. Tables 6.11, 6.13 and 6.15, and Equations (6.16), (6.17) and (6.18) provide the hints on how to design the CRBBE encoders and their RB partial product generators.
In CRBBE-1 algorithm, the sign of the lower significance digit \( d_i \) is responsible for whether or not to negate the absolute value of the compound coefficient. If it is negative, the generated RB partial product is kept in its original format. If it is positive, the digits \( r^+ \) and \( r^- \) of the compound coefficient \((r^+, r^-)\) are swapped to obtain the negative RB partial product. Fig. 6.8 shows the circuit of a slice of CRBBE-1 which consists of two normal Booth-1 encoders to cogenerate the Booth encoded digits, \( d_{i+1}d_i \) to form the compound coefficient \((r^+, r^-)\). The circuit in each dash rectangle is an independent Booth-1 encoder. The active low swap flag from the common input bits of the two encoders signals the corresponding partial product generator circuit to negate the resultant digit of the RB partial product by swapping the coefficients if it is 0. No swapping is required if it is 1.

\[
\begin{array}{c}
b_{i+1} \\
\uparrow \\
1M_{i+1} \\
\downarrow \\
b_i \\
\uparrow \\
1M_i \\
\downarrow \\
b_{i-1} \\
\rightarrow \text{swap,}
\end{array}
\]

**Figure 6.8 CRBEBE-1 encoder**

Fig. 6.9 shows one digit slice of the partial product generator working together with the CRBBE-1 encoder of Fig. 6.8, where the control signals \( 1M_{i+1} \) and \( 1M_i \) come from \( d_{i+1} \) and \( d_i \), respectively, and \( y_{k-1}, y_k \) are the multiplicand bits. The swapping logic is implemented in CMOS logic style. It is also possible to be implemented in pass transistor logic.
Fig. 6.10 shows the CRBBE-1.5 encoder, which includes a Booth-1 encoder in the upper dash box to generate the coefficient, $r_i^+$, and a Booth-2 encoder in the lower dash box to generate the coefficient, $r_i^-$. The active low swap flag from the MSB input of the Booth-2 encoder signals to the corresponding partial product generator on when to swap the generated coding bits for the RB digit of the partial product.

Fig. 6.11 shows one digit slice of the RB partial product generator for CRBBE-1.5 encoder.
The algorithm of CRBBE-2 requires the recoding of some Booth encoded digit pairs to the desirable positive-negative pairs. The conversion function is implemented in the CRBBE-2 encoders, making the encoder more complex. However, the number of Booth encoders in a multiplier is limited, occupying only a small fraction of the total area. The excess hardware in CRBBE-2 encoders is negligible compared with the amount of hardware required in the partial product summing tree. Thus the same RB partial product generators that are used repetitively can be simplified as much as possible. Fig. 6.12(a) shows the CRBBE encoder circuit which is composed of two adjacent Booth-2 encoders. The lower encoder is fed from the binary bits $b_{2i+1}b_{2i}b_{2i-1}$ of the multiplier, and generates the signals $1m_i, 2m_i$ and a sign bit, $sgn_i$ taken directly from the MSB, $b_{2i+1}$ while the upper encoder is fed from the bits $b_{2i+3}b_{2i+2}b_{2i+1}$, and generates the signals $1m_{i+1}, 2m_{i+1}$ and a sign bit, $sgn_{i+1} = b_{2i+3}$. All these output signals will be reformatted according to Table 6.13 before they are passed to the RB partial product generators.

The reformattting circuit is shown in Fig. 6.12(b). From Table 6.13, when the most significant digit is zero, its sign bit is complemented before it is used as an active high swap flag. Otherwise, the original sign is used as an active high swap flag. An active low
swap flag is also generated for the swapping circuit of the RB partial product generator. Therefore, we have (6.18)

\[
\text{swap}_i = (l_{m_i+1} + 2m_{i+1}) \text{sgn}_{i+1} + (l_{m_i+1} + 2m_{i+1}) \text{sgn}_{i+1}
\]

\[
= (l_{m_i+1} + 2m_{i+1}) \oplus \text{sgn}_{i+1}
\]

\[
= (l_{m_i+1} + 2m_{i+1}) \oplus \text{sgn}_{i+1}
\] (6.18)

The $5M$ signal is generated by (6.19)

\[
5M = (\text{sgn}_i \oplus \text{sgn}_{i+1}) \cdot 1m_{i+1} \cdot 1m_i
\] (6.19)

It is noted that for each encoder the output signals $2m$ and $1m$ are mutually exclusive. The signal $5M$ is also mutually exclusive to $2m$ but not to $1m$. Such relationship can be exploited to simplify the hardware complexity of the reformatting circuits.

(a) Booth-2 encoders for CRBBE-2
To convert the duplets 12 to 22 or 12 to 22, the output signals of 2m+i and 1m+i from the upper encoder are to be complemented in order to convert 1|1| to 2|2| when the contiguous digits are of the same sign and when both signals 1m+i and 2m+i are active. For all other duplets, the output signals 2m+i and 1m+i retain their original values. The converted signals, 2M+i and 1M+i of the upper encoder are given by (6.20) and (6.21):

\[
2M_{i+1} = (1m_{i+1} \cdot 2m_i \cdot (\text{sgn}_i \odot \text{sgn}_{i+1})) \cdot 2m_{i+1} + (1m_{i+1} \cdot 2m_i \cdot (\text{sgn}_i \odot \text{sgn}_{i+1})) \cdot 2m_{i+1}
\]

\[
= (1m_{i+1} \cdot 2m_i \cdot (\text{sgn}_i \odot \text{sgn}_{i+1})) \oplus 2m_{i+1} \tag{6.20}
\]

\[
1M_{i+1} = (1m_{i+1} \cdot 2m_i \cdot (\text{sgn}_i \odot \text{sgn}_{i+1})) \cdot 1m_{i+1} + (1m_{i+1} \cdot 2m_i \cdot (\text{sgn}_i \odot \text{sgn}_{i+1})) \cdot 1m_{i+1}
\]

\[
= (1m_{i+1} \cdot 2m_i \cdot (\text{sgn}_i \odot \text{sgn}_{i+1})) \oplus 1m_{i+1} \tag{6.21}
\]

In the partial product generation circuit, the hard multiple 5M is to be generated separately from the other multiples using the circuit of Fig. 6.4. Fig. 6.13 shows one digit of the RB partial product generator for the CRBBE-2 algorithm. The precomputed 5M
multiple bits are gated by the control signal, $5M$ from the encoders. The $1M$ multiplicand bits are controlled by the output $1M$ (the control signals $1M_i$ and $1M_{i+1}$ are collectively called the $1M$ signal) of both encoders as well as the control signal, $5M$ because the control signals $1M$ and $5M$ are not mutually exclusive. Therefore, the RB partial product is $1M$ only when the control signal, $5M$ is inactive. Control signal, $2M$ (comprising the control signals, $2M_i$ and $2M_{i+1}$) is responsible for generating the multiple $2M$ since it is mutually exclusive to $1M$ and $5M$. The logic gates in the input stage of the CRBBE-2 RB partial product generator are realized with complementary CMOS logic style to reduce their loading to the CRBBE-2 encoders because a large number of such input stages are connected in parallel and driven by the outputs of the CRBBE-2 encoders. Hence, strong driving capability is required from the CRBBE-2 encoders. The output stage of the RB partial product generator is implemented in transmission gate style with unity fanout. It drives the CMOS input stage of the redundant binary adder (RBA) in the partial product summing tree.

Compared with the RBSD multiplier, the CRBBE encoder and its corresponding RB PPG are much simpler than that of the RBSD multiplier. A legitimate comparison for the same
number of RB partial products generated should be Booth-2 CRBBE and RB PPG versus Booth-4 RBSD encoder and RB PPG, and Booth-1.5 CRBBE and RB PPG versus Booth-3 RBSD encoder and RB PPG. The CRBBE encoder and the RB PPG circuits are only a little more complex than the NBE-RBPPG encoder of the same radix. This small overhead is compensated by the simplified RB summing tree due to the elimination of correction vector. In some cases, it can even save a layer of RBAs.

6.5.2 Redundant Binary Adders

Redundant binary adder (RBA) is the most frequently used component in an RB multiplier, which adds two RB digits and output one RB digit. The compression ratio of RBA is 2:1. Thus, the RBA acts like a 4-2 compressor in NB multiplier. Besides, RBA summing tree has simpler interconnects than the conventional NB Wallace tree. This structural regularity can be translated into pronounced power-delay and area efficiency in deep-submicron design.

There are six possible input combinations for an RBA, which adds two RB digits, \(a_i\) and \(b_i\) and generates the intermediate sum \(s_i\) and intermediate carry \(c_i\) before it outputs the final sum \(d_i\). The carry-free adding rule for RBA is summarized as follows. In an array of RBAs that adds two RB numbers to one, the \(i\)-th RBA computes the \(i\)-th digits, \(a_i\) and \(b_i\) of two input numbers. To avoid the carry propagation, the RBA prompts its adjacent RBA at the next significant digit position through the signal \(h_i\). The hinting flag, \(h_i\), outputs 0 when the two input digits are both non-negative, otherwise it outputs 1. A hint of 0 indicates that the current addition will probably produce a carry of 1 while a hint of 1 means that the current addition will probably produce a carry of \(\bar{1}\). The hinting flag is generated according to (6.22). The cases of \(0 + 0\) and \(1 + 1\) can be classified under either \(h_i = 0\) or \(h_i = 1\). Their classification in (6.22) is merely for the ease of hardware implementation.

\[
\begin{align*}
    h_i &= \begin{cases} 
    0 & \text{if addition } \in \{0 + 0, 0 + 1, 1 + 1\} \\
    1 & \text{if addition } \in \{1 + \bar{1}, 0 + \bar{1}, \bar{1} + \bar{1}\}
    \end{cases} \\
\end{align*}
\]
Chapter 6  Covalent Redundant Binary Booth Encoded Multiplier

The \( i \)-th RBA also receives \( h_{i-1} \) from its neighboring RBA, i.e., the \((i-1)\)-th RBA, to determine its intermediate sum \( s_i \) and carry \( c_i \). If the \( i \)-th RBA receives \( h_{i-1} = 0 \) from the \((i-1)\)-th RBA that hinting the current addition may cause the sum to be 1, it will generate an intermediate sum \( s_i \) of 1 and an intermediate carry \( c_i \) of 1. Similarly, if it receives the hint \( h_{i-1} = 1 \) from the \((i-1)\)-th RBA that the current addition may cause the sum to be 1, an intermediate sum \( s_i \) of 1 and an intermediate carry \( c_i \) of 1 will be generated. If the current addition causes the sum to be 0, the hint \( h_{i-1} \) is immaterial. The final sum \( d_i \) is obtained by adding the current immediate sum \( s_i \) and the immediate carry \( c_{i-1} \) from the \((i-1)\)-th RBA. As the carry \( c_i \) is independent of \( c_{i-1} \), the addition is carry free. The adding rules are listed in Table 6.16. The inputs are \( h_{i-1} \), \( a_i \) and \( b_i \) while the outputs are \( h_i \), \( c_i \) and \( s_i \).

<table>
<thead>
<tr>
<th>( a_i ) and ( b_i )</th>
<th>( h_{i-1} )</th>
<th>( h_i )</th>
<th>( c_i )</th>
<th>( s_i )</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 and 0</td>
<td>Any</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>1 and 1</td>
<td>Any</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0 and ( \tilde{1} )</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td></td>
<td>1</td>
<td>1</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>0 and 1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td></td>
<td>1</td>
<td>0</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>1 and 1</td>
<td>Any</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>1 and 1</td>
<td>Any</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
</tbody>
</table>

The RBA cells shown in Fig. 6.14 are all designed to be in compliance with the above redundant binary adding rules. The differences are that they use different coding formats and circuit optimization techniques. RBA2 uses the sign-magnitude coding for adding the two RB numbers expressed in the sign values, \( a_i^s \) and \( b_i^s \), and the absolute values \( a_i^m \) and \( b_i^m \) to produce the sum bits, \( d_i^s \) and \( d_i^m \). The positive-negative complement coding is adopted in RBA5. It is worth noting that this circuit is easier to optimize due to its symmetry. RBA1, RBA3 and RBA4 use the positive-negative coding. RBA1 and RBA3 use the real carry-in digit as one of the input signals while RBA4 uses the hinting flag \( h_{i-1} \) and other signals to communicate the carry-in information from the adjacent RBA of one
digit lower in significance. Since RBA4 is an improved version with positive-negative coding, it is used in our RB multiplier.

To simplify the design of the positive-negative coding RBA, the input coding of (1,1) is refrained from feeding into the RBA directly. It is changed to (0,0) using the circuit in Fig. 6.15. This circuit is used for every digit input of the RBA summing tree to eliminate the (1,1) input.
Chapter 6

Covalent Redundant Binary Booth Encoded Multiplier

Figure 6.15 Circuit to inhibit (1,1) input for positive-negative coding

The circuit implementation of RBA4 is shown in Fig 6.16.

Figure 6.16 Circuit implementation of RBA4

6.5.3 RB to NB Conversion

The RB-to-NB conversion algorithm is derived from the basic RB number definition. Since we use the positive-negative coding, the converted value $Z$ is shown in (6.23):

$$Z = F' - F^-$$  \hspace{1cm} (6.23)

If it is to be converted to the two’s complement format, as in most multipliers, we have (6.24):

$$\text{ }$$

175
Chapter 6 Covalent Redundant Binary Booth Encoded Multiplier

\[ Z = F^+ + F^- + 1 \]  \hspace{1cm} (6.24)

If it is to be converted to unsigned number, as required by the IEEE floating point standard, the final result cannot be a negative number. It is equivalent to treating the \( F^+ \) and \( F^- \) as two's complement numbers. Therefore, (6.24) is still valid.

The reverse conversion is optimally implemented by a carry select method [MAK96], which comprises a simple and high-speed carry propagate circuit. We denote each digit in the final RB partial product \((F^+, F^-)\) as \((f^*_k, f^-_k)\), and the carry-in from the next lower order digit as \(c_{k-1}\). The \(k\)-th bit of the final sum \(z_k\) and carry-out \(c_k\) in NB format are derived in (6.25):

\[
\begin{align*}
    z_k &= f^*_k \oplus f^-_k \oplus c_{k-1} \\
    c_k &= f^*_k \cdot f^-_k + f^*_k \cdot c_{k-1} + f^-_k \cdot c_{k-1} \\
    c_0 &= 1
\end{align*}
\]  \hspace{1cm} (6.25)

In the positive-negative RB coding, \(f^*_k\) and \(f^-_k\) can never become "1" simultaneously. Therefore, \(c_k\) can be rewritten as:

\[
    c_k = f^*_k \cdot c_{k-1} + f^-_k \cdot c_{k-1}
\]  \hspace{1cm} (6.26)

Using this carry select method, an \(N\)-bit converter can be implemented as shown in Fig. 6.17. Starting from one digit per group, the digits beyond the second digit are grouped in such a way that the number of digits in the group is incremented by one progressively.

Figure 6.17 Circuit implementation of RB-NB converter
6.5.4 Proposed CRBBE 54 x 54-bit multiplier

Fig. 6.18 shows the block diagram of the proposed 54 x 54-bit RB multiplier architecture using the novel CRBBE-2 as the multiplier encoder. The partial product generation stage using the CRBBE-2 encoding algorithm is shown on the left hand side of the figure. In the top right hand corner is the RBA summing tree stage, while the RB-NB converter stage is on the bottom right hand corner. As mentioned before, this architecture is more hardware efficient than the other architectures to be shown later. There are 14 partial products generated directly with no correction vector. The tree structured RBA array has four levels for adding the 14 partial products into one RB number. In the RB-NB conversion stage, the final RB number with 108 digits is converted to an NB number with the same number of bits, which are accomplished in segments of different timings.

![Block Diagram of Proposed CRBBE 54 x 54-bit RB Multiplier](image)

**Figure 6.18 Proposed CRBBE 54 x 54-bit RB multiplier**

The other architectures of 54 x 54-bit multipliers using the NBE-RBPPG and RBSD encoders are presented for comparison. To be fair, these architectures are chosen with the same or almost same number of RB partial products from the RB PPG. In NBE-RBPPG architecture, the partial product generation stage uses the NB Booth-2 encoder. The other
configurations are the same as the CRBBE architecture. Fig. 6.19 shows the block diagram of a referenced 54×54-bit NBE-RBPPG RB multiplier architecture.

![Diagram of NBE-RBPPG 54×54-bit RB multiplier](image)

**Figure 6.19** NBE-RBPPG 54×54-bit RB multiplier

The RBSD architecture is shown in Fig. 6.18, where Booth-4 encoder is used. Altogether 14 partial products are generated directly and no correction vector is needed. It is also noted that the RBPP14 is generated by one RB Booth-2 circuit instead of a Booth-4 circuit to reduce the amount of hardware. Fig. 6.20 shows the block diagram of the RBSD 54×54-bit RB multiplier.
6.6 Simulation results

All the multiplier circuits stated in Section 6.5.4 are simulated in HSIM 3.0 using TSMC 0.18μm technology under the supply voltage of 1.8V. The simulation speed is set as "HSIMSPEED=0", which is the most accurate mode. 4096 randomly generated data with the input rate of 100MHz is applied to the circuits under simulation. All inputs of the multiplier are driven through buffers and all outputs are loaded with inverters to ensure that the circuits work in the same environment as in actual application. For each multiplier, the delay is calculated from the earliest transition of the 108 inputs to the latest transition of the 108 outputs. The worst case delay is the longest one of all the delays computed in the simulation process. The simulation results are shown in Table 6.17. It is evident that our proposed CRBBE multiplier outperforms the other two RB multipliers. Although the transistor count of CRBBE multiplier is a little more than that of the NBE-RBPPG multiplier, almost all these transistors are located in the input stage of the PPG circuit. This part of the circuit consumes a relatively small fraction of power of the entire multiplier because there is nearly no glitches in the PPG input stage to dissipate extra power. However, the RBA tree of the CRBBE multiplier has one row of RBAs lesser
than the NBE-RBPPG multiplier, which accounts for an overall power saving of 3.8%. The RBSD multiplier uses much more transistors than the other two multipliers due to its inefficient composition of PPG, which consumes an excessive number of transistors. The delay of the CRBBE multiplier is also the shortest because its PPG stage is faster than that of RBSD multiplier and its RB-NB converter has shorter critical path than that of the NBE-RBPPG multiplier. Therefore, its power delay product has a betterment of 6.3% and 10.4% over the NBE-RBPPG and RBSD multipliers, respectively.

| Table 6.17 Simulation results of RB 54×54-bit multipliers |
|-----------------------------------|---------------|---------------|---------------|
| Transistor count | Power (mW)@100MHz | Delay (ns) | PDP (pJ) |
| NBE-RBPPG | 82k | 26.2 | 4.62 | 121.0 |
| RBSD | 112k | 27.8 | 4.55 | 126.5 |
| CRBBE | 86k | 25.2 | 4.50 | 113.4 |

6.7 Summary

A new redundant binary multiplier is proposed based on the newly developed Covalent Redundant Binary Booth Encoding (CRBBE) algorithm. The proposed algorithm fully exploits the characteristics of the Booth encoded numbers to overcome the problem of generating hard multiples and achieves a compatible reduction of RB partial products without inducing any correction vector. Traditionally, each Booth encoded digit serves as a signed digit coefficient for the generation of a row of NB partial products and two rows of NB partial products are used to form a single RB partial product. In our proposed algorithm, adjacent coefficients form a positive-negative pair or the disqualified pair can be readily reformatted into a legal pair. Thus the RB partial products using the positive-negative coding are spontaneously generated with our CRBBE encoding. In fact, two RBSD partial products blend nicely into a RB partial product by bonding two Booth encoders covalently with an overlapping input bit to form a compound coefficient for its generation. Besides, it has the advantage that no additional RB correction vector is needed compared with the NBE-RBPPG multiplier. Thus the RBA tree is simplified with lower power dissipation and shorter worst case delay. Compared with the RBSD multiplier, which is also used to address the hard multiple problem, the proposed CRBBE
multiplier has only half the number of partial products to be summed if the same radix Booth encoders are used. Conversely, for the same number of RB partial products, the CRBBE encoder, say Booth-2 CRBBE, and its corresponding RB PPG is much simpler than that of the RBSD multiplier, say Booth-4 RBSD encoder and RB PPG, which uses many high fan-in gates. The simulation results show that the CRBBE multiplier consumes less power and computes faster than the NBE-RBPPG and RBSD multipliers.
Chapter 7

Conclusions and Future Research

7.1 Conclusions

In the thesis, we present several new high performance digital arithmetic cells and macrocells that focus on low power and low voltage applications, where the novelty covers from the circuit level to algorithmic level.

In the topic of full adder cell design, two novel 1-bit adder cells consisting of the XOR/XNOR, sum and carry out subcircuits, are proposed. The pass logic design style is used to efficiently generate the XOR and XNOR functions simultaneously and a good drivability carry output is generated by a novel complementary CMOS style circuit with regular layout. In addition, the last-stage inverter decouples the output and input to improve the driving capability. Despite having higher transistor count than the recently reported designs, the two cells have shown to be highly power efficient over a wide supply voltage range. The simulation results also demonstrate that the improved Module 1 of Hybrid-2 cell has made it the most power-efficient cell among a number of current-art 1-bit adder cells over a wide range of supply voltages. The energy efficiency of the Hybrid-2 cell is most pronounced at sub-1V operation. The layout of the Hybrid-2 cell shows that it is also area efficient.

An optimization procedure is proposed to size the transistors of the full adder cells in order to allow a fair comparison of different designs obtained from published literatures because the transistor sizing for optimal performance is technology dependent. The optimization procedure adopts a greedy sweeping strategy to search for the optimized
transistor sizes in a number of iterations until a predefined termination criterion based on the metric being optimized is reached. We have proven that the proposed algorithm is a convergent algorithm. Besides, we have also proposed a reasonably simple architecture to simulate the adder cell in an environment realistic to its actual deployment in the most frequently used parallel multiplier structure.

More complicated arithmetic building blocks have also been investigated. The architectures of 4-2 and 5-2 compressors are analyzed and different CMOS logic style circuit implementations of their constituent modules are explored. A novel 5-2 compressor architecture of 4Δ delay is proposed. In order to realistically assess and compare the figures of merits of different configurations of 4-2 and 5-2 compressors at various supply voltages, new simulation environments are established to ensure that the measured performances are still sustainable when these cells are integrated in a carry save adder (CSA) tree. The simulation results show that the 4-2 and 5-2 compressors constructed with the novel XOR* cell is able to function down to 0.6V, and features high speed and low power characteristics. Our proposed 5-2 compressor architecture outperforms all the other architectures over the range of voltages simulated, particularly when it is configured with the proposed circuits for the XOR-XNOR and the carry generator modules. Better performances against other architectures are also attained almost irrespective of the logic styles used for the circuit implementation of their constituent modules. In summary, a library of excellent power efficiency 4-2 and 5-2 compressor cells based on the advanced CMOS process technology has been developed for implementing high speed and low power multipliers operable at ultra low supply voltage.

Moving up the hierarchy of complex arithmetic circuit, a new algorithm for the design of a VLSI circuit for scalar product evaluation has been developed. The algorithm has been ported to a novel full bit parallel architecture of scalar product macrocell featuring low interconnect complexity, improved power efficiency and highly efficient VLSI area utilization. More importantly, the layout regularity and scalability enhance its performance superiority in deep submicron regime well above conventional VLSI design
of vector processing unit for scalar product computation. Some auxiliary circuits including a unique delay detection circuit are proposed to enable accurate delay measurement of the core under the constraint of limited IO pins. The floor-planning of the proposed architecture exploits the binary data locality through the border between the multiplication and accumulation operations based on a full combinational logic implementation. A comparison with the layout of the conventional vector multiplier shows that our proposed decomposition algorithm has led to a more compact, regular and modular physical design. A theoretical model for estimating the area and delay has been formulated. Comparing with the conventional architecture of the same capacity, the estimation shows that our design of a 16 bit scalar product multiplier on input vectors of 16 elements achieves a saving of 36.5% of silicon area, up to 59% increase in area usage efficiency and 23% decrease in interconnect wire delay. The overall performances of average power consumption, worst case delay and power efficiency of our post-layout circuit surplus even the pre-layout circuit of conventional architecture when these circuits are simulated using Synopsys Nanosim over supply voltages from 0.7V to 3.3V based on Chartered CSM 0.18μm CMOS technology. The post-layout simulation shows that the worst case delay of the core at 1.8V is 6.92ns. At 50MHz input data rate, the power dissipation is 65.0mW. The relatively small deviations between the pre- and post-layout simulation results validate the inference of the theoretical estimation that the key contributors to the delay and power reduction of our proposed architecture are the shorter and balanced global and local interconnect wires, a dominant factor in design consideration for VLSI circuits fabricated in the deep submicron technology.

In the topic of redundant binary multiplication, a new RB multiplier is proposed based on our newly developed Covalent Redundant Binary Booth Encoding (CRBBE) algorithm. The proposed algorithm fully exploits the characteristics of the Booth encoded numbers to overcome the problem of generating hard multiples and achieves a compatible reduction of RB partial products without inducing any correction vector. Traditionally, each Booth encoded digit serves as a signed digit coefficient for the generation of a row of NB partial products and two rows of NB partial products are used to form a single RB partial product. In our proposed algorithm, adjacent coefficients form a positive-negative
pair or the disqualified pair can be readily reformatted into a legal pair. Thus the RB partial products using the positive-negative coding are spontaneously generated with our CRBBE encoding. In fact, two RBSD partial products blend nicely into a RB partial product by bonding two Booth encoders covalently with an overlapping input bit to form a compound coefficient for its generation. Besides, it has the advantage that no additional RB correction vector is needed compared with the NBE-RBPPG multiplier. Thus the RBA tree is simplified with lower power dissipation and shorter worst case delay. Compared with the RBSD multiplier, which is also used to address the hard multiple problem, the proposed CRBBE multiplier has only half the number of partial products to be summed if the same radix Booth encoders are used. Conversely, for the same number of RB partial products, the CRBBE encoder and its corresponding RB PPG is much simpler than that of the RBSD multiplier and its RB PPG, which uses many high fan-in gates. The simulation results show that the CRBBE multiplier consumes lesser power and computes faster than the NBE-RBPPG and RBSD multipliers.

In summary, we have reported the major contributions from various levels of abstractions in VLSI design. Besides, we also presented important contributions on the algorithms and circuits for optimization, simulation and testing methods which we envision will arouse further interest in the research of low power design methodologies.

7.2 Future research

Based on the research presented in this thesis, we will now suggest several potential relevant topics for further research.

7.2.1 Exploitation of the proposed full adders in fixed multiplierless digital filter implementations

Since full adder cells are the basic building blocks of many arithmetic function units, the power-efficient Hybrid-2 cell can be used to construct those full-adder based digital systems, such as the merged arithmetic based FIR architecture proposed by our research.
team members in [YE03, YE04]. In this architecture, the operations of the multiplication are decomposed into partial products accumulation, and merged with additions to be optimized. If these carry-save-adders to fulfill the summation of all partial products and the final carry propagate adder are implemented with our proposed Hybrid-2 cell, better performance can be achieved and it would be of interest to study the level of performance that can be boosted up by our proposed full adder.

7.2.2 Exploitation of 4-2 and 5-2 compressors in the design of multiply-accumulator

Multiply-accumulator (MAC) is essentially a normal multiplication of two numbers followed by the addition of a third number to the result. In digital image and audio signal processing, the widths of the data are usually integer’s power of two. The most common operand widths encountered in these multimedia applications are 8, 16 and 32. If the partial product reduction is accomplished all through the 4-2 compressors, additional layer of full adders or 4-2 compressors has to be used, which increases the delay and power consumption. With our optimized 5-2 compressor, more input terminals are available to contain the excess number of input bits at the last layer while still maintaining the same regularity of the architecture composed by only 4-2 compressors.

7.2.3 Integration of the scalar product macrocell in special purpose digital signal processors

As stated in Chapter 5, our optimized scalar product macrocell is capable of processing 16 multiplications and accumulations with the operands of 16-bit word length. It can be configured to work in either filter mode or normal scalar product multiplier mode. Therefore it is very powerful in parallel processing of digital image pixels, video and audio data stream. To maximize this capability, it can be integrated as a dedicated arithmetic unit into a single instruction multiple data (SIMD) processor or new stream based processor architecture for enhanced performance in multimedia digital processing.
Chapter 7

7.2.4 Design, floor planning and layout of double precision floating point multiplier with the proposed CRBBE algorithm

In IEEE 754 standard, a double precision floating point multiplication requires two 54-bit operands as multiplicand and multiplier, where 53 bits are visible by the computer users and another bit is required to ensure the rounding error is small enough not to affect the user visible bits in the final result. Thus, many proposed methods for multiplication are applied to the design of 54x54-bit multiplier. Our proposed CRBBE algorithm performs better from the prelayout simulation than other methods to realize this multiplication. It is envisaged that similar performance on real silicon can be achieved with careful design, floor planning and layout of the CRBBE double precision floating point multiplier. It would be of significant commercial value to create a fully characterized floating point RB multiplier IP core.

7.2.5 Scalable integer multiplier using the proposed CRBBE algorithm

Normally the bit-width of a digital multiplier is fixed. Therefore, the multiplier is often designed for the maximum operand width. The entire n×n-bit multiplier circuit is activated to perform the operation even if the operands are less than n bits. This has severely limited the versatility of the multiplier in some applications. For example, in adaptive filter, the precision of the operands and the coefficients may change dynamically to suit the resolution of the front-end circuitry. It would be an interesting research in developing scalable multiplier, which can be reconfigured to adapt to varying input data length. Since our proposed CRBBE algorithm generates the partial product in the most efficient way without any additional correcting vector, it is a viable candidate for implementing scalable integer multiplier based on RB arithmetic. An important criterion for scalability is the ease of composing a larger word length multiplier from several smaller word length multipliers. If it is implemented with other RB multipliers, the extension of bit width is hampered by the co-generated correction vectors. If it is implemented with NB multipliers, the processing of signed number will use more multiplexers and connecting wires to select and route the partial products. Therefore, the elimination of correction vectors due to the CRBBE algorithm appear to be a promising
feature in fulfilling the succinctness of inter-module connectivity for the scalable multiplier.

7.2.6 Digital filters and correlators using the proposed CRBBE algorithm

One good way to realize more versatile digital filters and correlators is to use CRBBE algorithm for scalable tap multipliers as suggested in the future work Section 7.2.6. Due to the scalability offered by the algorithm, change in filter coefficients or input operand width can be addressed by conveniently reconfiguring the sizes of the filter taps. Thus the design turnaround period for similar filter order under different application environments can be greatly shortened.

7.3 Design challenges in future ultra deep submicron technologies

As technology scaling continues to advances with shrinking feature sizes, such as 90nm, 65nm, 45nm and below, it will be met by the perpetual quest for much higher performance of systems on integrated circuits. This is practical not only because higher operating frequency is easier to achieve but also due to the denser integration and low power design techniques will be much more pressing than ever. Nevertheless, there will be unprecedented challenges in leakage power and noise as the supply voltage Vdd scales down to below 1V in 65nm technology. It is estimated that at the end of 2010, the supply voltage will reach 0.6V [ITRS03]. Although the threshold voltage Vth will be proportionally scaled down to avoid the performance from being degraded exceedingly, the narrower margin between Vdd and Vth will cause more standby power and less leeway for compensation at circuit design level [MOR04]. At the same time, the interconnect wire problem arises when the designers pack more functions on a single chip with increasing bus width and higher number of global interconnect wires. The ratio of wire performance to the gate performance will worsen [HO01]. Therefore, in the sub-100nm technology era, designers will need more effort to fulfill both the demands for
high performance and low power. The low power and low voltage design techniques in the dissertation will continue to play an important role in the near future. To continue to benefit from the economy of scale of CMOS fabrication technology, it is also important to explore new research direction to deal with the different types of future advanced deep submicron noise such as cross talk, leakage, supply noise and process variation which are obstacles in the way of achieving the desired level of noise immunity without severely compromising the improvement already achieved in performance and energy efficiency.
Author’s Publications

Journal Papers


Conference Papers


Bibliography


[BUI00(1)] H. T. Bui, A. K. Al-Sheraidah, and Y. Wang, “New 4-transistor XOR and XNOR designs,” in Proc. of the Second IEEE Asia Pacific Conf. on ASICs, pp. 25-28, Aug. 2000


Bibliography


[FAN00] L. Fanucci and M. Forliti, “Interlaced diagonal-wise pipelined serial
Bibliography


S. Katkrori, and S. Alupoaei, "RT-level interconnect optimization in DSM
Bibliography


Bibliography


Bibliography


D. Sylvester, and K. Kuetzer, “Getting to the bottom of deep submicron,” in...


