Search

Scholarly Works (8 results)

Sort By:

Article
Peer Reviewed

Methods for multitasking among real‐time embedded compute tasks running on the GPU

UC Davis Previously Published Works (2017)

In this study, we provide an extensive survey on wide spectrum of scheduling methods for multitasking among graphics processing unit (GPU) computing tasks. We then design several schedulers and explain in detail the selected methods we have developed to implement our scheduling strategies. Next, we compare the performance of schedulers on various workloads running on Fermi and Kepler architectures and arrive at the following major conclusions: (1) Small kernels benefit from running kernels concurrently. (2) The combination of small kernels, high-priority kernels with longer runtimes, and lower-priority kernels with shorter runtimes benefits from a CPU scheduler that dynamically changes kernel order on the Fermi architecture. (3) Because of limitations of existing GPU architectures, currently CPU schedulers outperform their GPU counterparts. We also provide results and observations obtained from implementing and evaluating our schedulers on the NVIDIA Jetson TX1 system-on-chip architecture. We observe that although TX1 has the newer Maxwell architecture, the mechanism used for scheduler timings behaves differently on TX1 compared to Kepler leading to incorrect timings. In this paper, we describe our methods that allow us to report correct timings for CPU schedulers running on TX1. Finally, we propose new research directions involving the investigation of additional scheduling strategies.

Cover page: Methods for multitasking among real‐time embedded compute tasks running on the GPU

Article
Peer Reviewed

Multitasking Real-time Embedded GPU Computing Tasks

UC Davis Previously Published Works (2016)

Article

FPGA versus GPU for Speed-Limit-Sign Recognition

UC Davis Previously Published Works (2018)

Article
Peer Reviewed

Feature-Based Speed Limit Sign Detection Using a Graphics Processing Unit

IDAV Publications (2011)

In this study we test the idea of using a graphics processing unit (GPU) as an embedded co-processor for realtime detection of European Union (EU) speed-limit signs. The input to the system is a set of grayscale videos recorded from a forward-facing camera mounted in a vehicle. We introduce a new technique for implementing the radial symmetry detector (RSD) efficiently using the native rendering capabilities of a GPU. The technique maps the algorithms to the hardware such that the detection of speed-limit sign candidates is significantly accelerated. The system reaches up to 88% detection rate and runs at 33 frames per second on video sequences with VGA (640x480) resolution on an embedded system with an Intel Atom 230 @ 1.67 GHz CPU and a NVIDIA GeForce 9400M GS GPU.

Cover page: Feature-Based Speed Limit Sign Detection Using a Graphics Processing Unit

Article
Peer Reviewed

Benchmarking Deep Learning Frameworks with FPGA-suitable Models on a Traffic Sign Dataset

Electrical & Computer Engineering (2018)

We benchmark several widely used deep-learning frameworks for performing deep-learning-related automotive tasks (e.g., traffic sign recognition) that need to achieve realtime and high accuracy results with limited resources available on embedded platforms such as FPGAs. In our benchmarks, we use various input image sizes on models that are suitable for FPGA deployment, and investigate the training speed and inference accuracy of selected frameworks for these different sizes on a popular traffic sign recognition dataset. We report results by running the frameworks solely on the CPU as well as by turning on GPU acceleration. We also provide optimizations we apply to fine-tune the performance of the frameworks. We discover that Neon and MXNet deliver the best training speed and inference accuracy in general for all our test cases, while Tensorflow is always among the frameworks with the highest inference accuracies. We also observe that on the particular dataset we tested on (i.e., GTSRB), the image size of the region of interest does not necessarily affect the inference accuracy, and that using deep models, e.g., ResNet-32, which have longer training times, might not provide improvements to inference accuracy.

Cover page: Benchmarking Deep Learning Frameworks with FPGA-suitable Models on a Traffic Sign Dataset

Article
Peer Reviewed

A Template-Based Approach for Real-Time Speed-Limit-Sign Recognition on an Embedded System using GPU Computing

IDAV Publications (2010)

We present a template-based pipeline that performs real-time speed-limit-sign recognition using an embedded system with a low-end GPU as the main processing element. Our pipeline operates in the frequency domain, and uses nonlinear composite filters and a contrast-enhancing preprocessing step to improve its accuracy. Running at interactive rates, our system achieves 90% accuracy over 120 EU speed-limit signs on 45 minutes of video footage, superior to the 75% accuracy of a non-real-time GPU-based SIFT pipeline.

Cover page: A Template-Based Approach for Real-Time Speed-Limit-Sign Recognition on an Embedded System using GPU Computing

Article
Peer Reviewed

Benchmarking Deep Learning Frameworks and Investigating FPGA Deployment for Traffic Sign Classification and Detection

Electrical & Computer Engineering (2019)

We benchmark several widely-used deep learning frameworks and investigate the FPGA deployment for performing traffic sign classification and detection. We evaluate the training speed and inference accuracy of these frameworks on the GPU by training FPGA-deployment-suitable models with various input sizes on GTSRB, a traffic sign classification dataset. Then, selected trained classification models and various object detection models that we train on GTSRB's detection counterpart (i.e., GTSDB) are evaluated with inference speed, accuracy, and FPGA power efficiency by varying different parameters such as floating-point precisions, batch sizes, etc. We discover that Neon and MXNet deliver the best training speed and classification accuracy on the GPU in general for all test cases, while TensorFlow is always among the frameworks with the highest inference accuracies. We observe that with the current OpenVINO release, the performance of lightweight models (e.g., MobileNet-v1-SSD, etc) usually exceeds the requirement of real-time detection without losing much accuracy, while other models (e.g., VGG-SSD, ResNet-50-SSD) generally fail to do so. We also demonstrate that we can adjust the precision of bitstreams and the batch sizes to balance inference speed and accuracy of the applications deployed on the FPGA. Finally, we show that for all test cases, the FPGA always achieves higher power efficiency than the GPU.

Cover page: Benchmarking Deep Learning Frameworks and Investigating FPGA Deployment for Traffic Sign Classification and Detection

Article
Peer Reviewed

Fast Deformable Registration on the GPU: A CUDA Implementation of Demons

IDAV Publications (2008)

In the medical imaging field, we need fast deformable registration methods especially in intra-operative settings characterized by their time-critical applications. Image registration studies which are based on Graphics Processing Units (GPUs) provide fast implementations. However, only a small number of these GPU-based studies concentrate on deformable registration. We implemented Demons, a widely used deformable image registration algorithm, on NVIDIA's Quadro FX 5600 GPU with the Compute Unified Device Architecture (CUDA) programming environment. Using our code, we registered 3D CT lung images of patients. Our results show that we achieved the fastest runtime among the available GPU-based Demons implementations. Additionally, regardless of the given dataset size, we provided a factor of 55 speedup over an optimized CPU-based implementation. Hence, this study addresses the need for on-line deformable registration methods in intra-operative settings by providing the fastest and most scalable Demons implementation available to date. In addition, it provides an implementation of a deformable registration algorithm on a GPU, an understudied type of registration in the general-purpose computation on graphics processors (GPGPU) community.

Cover page: Fast Deformable Registration on the GPU: A CUDA Implementation of Demons