Hardware : efficient techniques for FAST corner detector
Lim, Teck Chuan
Date of Issue2017-04-12
School of Computer Science and Engineering
Corner detector is the foundation to many computer vision applications such as high-speed object recognition or object analysis. These applications are beginning to find their way in battery-operated devices such as smart-phones, drones, mobile robots, etc. The corner detection algorithm for these applications must therefore meet real time constraints while ensuring low power consumption. Software implementation on embedded platforms often fails to simultaneously meet these conflicting requirements. For example, low end embedded microcontrollers consume less power but are slow. On the other hand, high end processor consumes higher power to achieve high speed. In order to meet the real-time and low power requirements, some well-known hardware acceleration methods for corner detection algorithm such as FAST, Shi - Tomasi, SUSAN and Harris have been presented in the literature. In this paper, the focus will be on hardware implementation for the FAST corner detector which has been reported to have the lowest execution time. In this thesis, six hardware designs have been proposed for the FAST corner detection architecture. These hardware designs aim to reduce the resources, computation time and power dissipation. The unrolled hardware design is proposed to eliminate the usage of a 7x7 convolution buffer in the baseline architecture to reduce the resources utilized by 5.3%. The merged hardware design provides resource sharing of the scoring units in the unrolled implementation by utilizing multi-pumping . Subsequently, another hardware design, Two’s Complement Merged (TCM) was proposed to remove the redundant multiplexors used by introducing a 2's complement operation at the end of the computation. With these optimizations, the total resources were reduced further by 27%. Eventually, a simpler design (XNOR TCM) was proposed to introduce the XNOR logic to simplify the complex pixel scoring module in the TCM approach. The simplification of the design reduced the number of switching activities which in turns reduce dynamic power dissipation by 47.5% when compared to the baseline architecture. To reduce the computation time, the delayed TCM employed pipelining in the critical path. This approach exploited the resource utilization achieved in previous designs. It led to a reduction in total thermal power dissipation by 18.5%, and total resource usage dropped by 14.9% while the difference in minimum period was only 6.3% difference compared to baseline architecture. Finally, the heuristics design was proposed to reduce the resources utilized in the non-maximal suppression module by introducing three scoring units. These hardware designs were implemented and demonstrated on the TERASIC DE2i-150 FPGA development kit.
Final Year Project (FYP)
Nanyang Technological University