Rapid design exploration framework for realizing custom computing systems on FPGAs
Aung, Yan Lin
Date of Issue2016
School of Computer Engineering
Centre for High Performance Embedded Systems
Field Programmable Gate Arrays (FPGAs) have now become one of the most preferred computing platforms for implementing configurable system-on-chip despite the challenges in meeting the cost, performance and energy requirements of embedded systems. The main driver for the proliferation of FPGAs lies in the demands for shorter time-to-market and lower non-recurring engineering pressures. However, lack of new design methodologies and techniques that can effectively leverage on the various hardware (reconfigurable logic and digital signal processing blocks) and software (soft and hard processor cores) computational resources in modern FPGAs continue to remain the bottleneck. The main aim of this research is to develop a constraint-aware design exploration framework for modern FPGA systems for the rapid realization of custom computing solutions. An efficient technique for the high-level software performance estimation has been proposed without necessitating application execution on the target processor or instruction set simulators. The proposed technique incorporates dynamic characteristics of the target processor such as branch penalty and data dependency in order to achieve high estimation accuracy. In addition, a novel control-flow mapping strategy has been introduced to realize the rapid estimation of compiler-optimized software. Experimental results based on widely used CHStone benchmark suite show that the proposed technique can be reliably used to estimate the software performance on PowerPC processor with an average estimation error of only 6%. In order to rapidly estimate the look-up tables (LUTs) utilization of custom hardware data-paths, a technology-mapping aware clustering technique has been proposed. Unlike the existing work, the proposed technique takes into account the synthesis optimizations and technology mapping, which are relied upon by commercial FPGA synthesis tools. Experimental results show that the proposed area estimation technique is able to estimate LUTs utilization of the data-paths with an average estimation error of 9%, which outperforms an existing technique by 29%, for Altera Cyclone II and 7% for Stratix IV FPGAs. A regression-based technique to estimate the LUTs utilization of finite state machine (FSM) based controllers has also been proposed. Multiple linear curve fitting was applied to obtain the parameters for the proposed regression model. Experimental results show the regression-based technique is able to estimate LUTs utilization of the FSMs with an average estimation error of 9% and achieves 24% improvement over an existing analytical technique. In addition, a strategy to estimate the utilization of on-chip digital signal processing (DSP) blocks for different types of multiply operators, based on synthesis inference models, has been developed. This provides for the efficient incorporation of DSP blocks during the estimation process. It has been successfully demonstrated that the proposed technique is capable of estimating DSPs and LUTs utilization for various multiply operators with an accuracy of 100% in almost all the cases. In order to estimate the critical path delay and cycle counts of custom hardware accelerators, a high-level estimation technique that relies on the technology-mapping aware clustering algorithm has been proposed. The proposed technique takes into consideration the synthesis optimizations employed by the commercial FPGA design tool in order to increase the estimation accuracy. Evaluations based on the hardware accelerators from a widely used CHStone benchmark suite show that the proposed technique is able to estimate the critical path delays with an average estimation error of 8% and 14% for Altera Cyclone II and Stratix IV FPGAs. It is noteworthy that the run-time of the proposed area-time estimation technique is in the order of milliseconds, thereby yielding three orders of magnitude speed up when compared with the commercial FPGA synthesis process and yet provides for reasonably accurate area-time estimation. Communication-aware hardware-software partitioning algorithm has been devised for identifying the profitable candidate blocks for hardware acceleration. A hybrid technique based on 0-1 Knapsack and modified Simulated Annealing has been proposed. The KnapSim algorithm can achieve near optimal solution at significantly lower run-time compared to an existing state-of-the-art genetic algorithm based approach. The proposed partitioning algorithm is used to realize a design exploration framework for constraint-aware (i.e. FPGA LUTs and DSP blocks) performance optimization. A case study using a widely used application demonstrates that the proposed framework is capable of rapid design exploration without invoking execution of compiled code and FPGA implementation. Finally, the proposed framework can be readily integrated with commercial FPGA toolchains in order to cope with the design exploration challenges associated with complex embedded computing applications.