The iDEA architecture-focused FPGA soft processor
Cheah, Hui Yan
Date of Issue2016
School of Computer Engineering
Centre for High Performance Embedded Systems
The performance and power benefits of FPGAs have remained accessible primar- ily to designers with strong hardware skills. Yet as FPGAs have evolved, they have gained capabilities that make them suitable for a wide range of domains and more complex systems. However, the low level, time-consuming hardware design process remains an obstacle towards much wider adoption. An idea gaining some traction recently is the use of soft programmable architectures built on top of the FPGA as overlays, with compilers translating code to be executed on these archi- tectures. This allows the use of strong compiler frameworks and also avoids the bit-level cycle-level design required in RTL design. A key issue with soft over- lay architectures is that when designed without consideration for the underlying FPGA architecture, they suffer from significant performance and area overheads. This thesis presents an FPGA architecture-focused soft processor built to demon- strate the benefits of leveraging detailed architecture capabilities. It uses the highly capable DSP blocks on modern Xilinx devices to enable a general purpose processor that is small and fast. We show that the DSP48E1 blocks in Xilinx Virtex-6 and 7-Series devices support a wide range of standard processor instruc- tions that can be designed into the core of a processor we call iDEA. On recent devices it can run close to the limit of 500MHz, while consuming considerably less area than other soft processors. We conduct a detailed design space exploration to identify the optimal pipeline depth for iDEA. We then propose the use of composite instructions to improve performance through better use of the DSP block, and show a speedup of up to 1.2× over a processor without composite instructions. Finally, we show how a restricted forwarding scheme that uses an internal DSP block accumulation path can eliminate some of the dependency overheads in executing programs, achieving a 25% improvement in execution time, compared to an alternative forwarding path implemented in the logic fabric, which offers only a 5% improvement. We benchmark our processor with a range of representative benchmarks and analyse it at the compiler, instruction, and cycle levels.