In-memory computing by RRAM for machine learning
Date of Issue2018-01-11
School of Electrical and Electronic Engineering
Centre for Integrated Circuits and Systems
The Internet data has reached exa-scale (1018 bytes), which has introduced emerging need to re-examine the state-of-the-art hardware architectures for data-oriented computing. There is an increasing need in the current artificial intelligence and machine learning developments. The traditional pure software-based platform cannot satisfy such an increasing need for data analytics. To analyze such a huge volume of data, we need a scalable hardware platform for both data storage and processing. The main bottleneck is from the well-known memory bottle-neck. As such, one needs to design a novel energy-efficient hardware architecture that is capable to perform the future big-data driven application. There are critical challenges in the traditional semiconductor memory technologies. With the scaling down to nano-scale, many problems appear such as process variation, leakage current and I/O bandwidth limitations. The recent emerging resistive-random-access- memory (RRAM) has shown great potential to be the solution for data-intensive applications. Besides the minimized leakage power due to non-volatility, RRAM in crossbar structure has been exploited as computational elements. As such, both memory and logic components can be realized in a power- and area- efficient manner. More importantly, it can provide a true in-memory logic-memory integration architecture without using I/Os. Therefore, the RRAM based architecture is considered as a potential universal memory for the future big-data applications. In this PhD thesis, we will explore the development of RRAM based logic operations and NVM in-memory architecture, as well as mapping details of various machine learning algorithms on such an architecture. Firstly, for the RRAM based in-memory computing implementation, NVM-SPICE is used for simulation for RRAM based circuits in circuit level. The measurement results of forming, SET and RESET processes are also shown. For the diffusive-type RRAM, a sub-circuit model is used for simulation. A binary RRAM-crossbar for matrix-vector multiplication is developed. Due to the non-uniformity of RRAM device, the proposed binary RRAM-crossbar can perform the computation more accurately compared to traditional RRAM-crossbar in analogue fashion. In addition, three kinds of crossbar structures including passive array, 1T1R and 1S1R are discussed. An RRAM based coupled oscillator network for L2-norm calculation is also shown. A simple oscillator can be built based on the ON-OFF switching of diffusive-RRAM or the forming process of drift-RRAM. When the basic oscillators form a network, it can perform an L2-norm calculation based on the fitted simulation results. Secondly, for the in-memory accelerator, a distributed in-memory computing architecture (XIMA) is developed. Apart from traditional store and load, we define two more commands: start and wait. In addition, all the logic blocks and data blocks are formed in pairs so that the delay of data-processing communication can be minimized. CMOS logic such as instruction queue and decoders are also designed. In addition, 3D CMOSRRAM architecture is investigated. For single-layer 3D architecture, RRAM devices are vias connecting top-layer wordlines and bottom-layer bitlines. All the other CMOS logics are implemented in the bottom layer. For multi-layer 3D architecture, we use two RRAM layers as data buffer and logic implementation, respectively. In addition, TSVs are used to connect different layers. Thirdly, for the data analytics mapping on RRAM based architecture, we have accelerated three machine learning algorithms on XIMA. The learning and inference procedures of single-layer feed-forward neural network (SLFN) have been optimized and partially mapped on the passive binary RRAM-crossbar. In addition, we mapped the binary convolutional neural network (BCNN) on both passive array and 1S1R array with different mapping schemes. Moreover, L2-norm gradient based learning and inference are also implemented on an RRAM network with both crossbar and coupled oscillators. As for 3D CMOS-RRAM architecture, we also mapped the optimized SLFN algorithm. All the operations in learning and inference stages are implemented so that it can achieve the on-line learning. In addition, tensorized neural network (TNN) is mapped on both single-layer and multi-layer accelerators with different mapping scheme.