The rapid growth of camera and storage capabilities, over the past decade, has resulted in an exponential growth in the size of video repositories, such as YouTube. In 2015, 400 hours of videos are uploaded to YouTube every minute. At the same time, massive amount of images/videos are generated from monitoring cameras for elderly, sick assistance, satellites for earth science research, and telescopes for space exploration. Human annotation and manual manipulation of such videos are infeasible. Computer vision technology plays an essential role in automating the indexing, sorting, tagging, searching and analyzing huge amount of video data. Object detection and activity recognition in general are some of the most challenging topics in computer vision today. While the detection/recognition accuracy has increased dramatically over the past few years, it has not kept up with the complexity of detection/recognition tasks nor with the increased resolution of the video/image sources. As a result, the computation speed, and power consumption, of computer vision applications have become a major impediment to their wider use. Thus applications relying on real-time monitoring/feedback are not possible under current speeds.
This thesis focuses on the use of Field Programmable Gate Arrays (FPGAs) to accelerate computer vision applications for embedded/real time applications while maintaining similar detection/recognition accuracy as the original processing. FPGAs are electronic devices on which an arbitrary digital circuit can be (re) configured under software control. To leverage the computational parallelism on FPGAs, fixed-point arithmetic is used for all implementations. The benefit of using fixed-point representation over floating point is the reduced bit-width, but the range and sometimes the precision are limited. Comprehensive studies are performed in this study to show that the classification system has some degree of tolerance to the reduced precision data representation. Hence FPGA programs are implemented accordingly in low bit-width fixed-point to achieve high computation throughput, low power consumption, and accurate classification.
As a first step, the impact of reduced precision is studied for Viola-Jones face detection algorithm: whereas the reference OpenCV code uses double precision floating-point values, by using only five decimal digit (17 bits) fixed-point representation, the detection can achieve the same rates of false positives and false negatives as the reference OpenCV code. By reducing the necessary precision by a factor of 3X to 4X, the size of the circuit on FPGA is reduced by a factor of 12X; hence increasing the number of feature classifiers that can be fit on a single FPGA. A hybrid CPU-FPGA processing pipeline is proposed to reduce CPU work-load.
As a second step, Histogram of Oriented Gradients (HOG), one of the most popular object detection algorithms, is evaluated by using the full-image evaluation methodology to explore the FPGA implementation of HOG using reduced bit-width. This approach lessens the required area resources on the FPGA and increases the clock frequency and hence the throughput per device through increased parallelism. Detection accuracy of the fixed-point HOG is evaluated by applying state-of-the-art computer vision pedestrian detection evaluation metrics. The reduced precision detection performs as well as the original floating-point code from OpenCV. This work then shows the single FPGA implementation achieves a 68.7x higher throughput than a high-end CPU, 5.1x higher than a high-end GPU, and 7.8x higher than the same implementation using floating-point on the same FPGA. A power consumption comparison for different platforms shows our fixed-point FPGA implementation uses 130x less power than CPU, and 31x less energy than GPU to process one image.
In addition to object detection algorithms, this thesis also investigates the acceleration of action recognition, specifically a human action recognition (HAR) algorithm. In HAR, pedestrian detection is normally used as a pre-processing step to locate human in stream video. In this work, the possibility to perform feature extraction under reduced precision fixed-point arithmetic is evaluated to ease hardware resource requirements.
The Histogram of Oriented Gradient in 3D (HOG3D) feature extraction is then compared with state-of-the-art Convolutional Neural Networks (CNNs) methods and result shows that the later is 75X slower than the former. The experiment shows that by re-training the classifier with reduced data precision, the classification performs as well as the original double-precision floating-point.
Based on this result, an FPGA-based HAR feature extraction is implemented for near camera processing using fixed-point data representation and arithmetic.
This implementation, using a single Xilinx Virtex 6 FPGA, achieves about 70x speedup over multicore CPU.
Furthermore, a GPU implementation of HAR is introduced with 80x speedup over CPU (on an Nvidia Tesla K20).