Search

Scholarly Works (2 results)

Thesis
Peer Reviewed

Online Learning for Orchestrating Deep Learning Inference at Edge

Shahhosseini, Sina
Advisor(s): Dutt, Nikil

UC Irvine Electronic Theses and Dissertations (2023)

Deep-learning-based intelligent services have become prevalent in cyber-physical applications including smart cities and health-care. Resource-constrained end-devices must be carefully managed in order to meet the latency and energy requirements of computationally-intensive deep learning services. Collaborative end-edge-cloud computing for deep learning provides a range of performance and efficiency that can address application requirements through computation offloading. The decision to offload computation is a communication-computation co-optimization problem that varies with both system parameters (e.g., network condition) and workload characteristics. On the other hand, deep learning model optimization provides another source of tradeoff between latency and model accuracy. An end-to-end decision-making solution that considers such computation-communication problem is required to synergistically find the optimal offloading policy and model for deep learning services. To this end, we propose a reinforcement-learning-based computation offloading solution that learns optimal offloading policy considering deep learning model selection techniques to minimize response time while providing sufficient accuracy. We demonstrate the efficacy of our strategies through experimental comparison with state-of-the-art RL-based inference orchestration. In addition, we investigate applying intelligent orchestration strategy in eHealth monitoring systems as a case study.

Cover page: Online Learning for Orchestrating Deep Learning Inference at Edge

Article
Peer Reviewed

Partition Pruning: Parallelization-Aware Pruning for Dense Neural Networks

UC Irvine Previously Published Works (2020)

As recent neural networks are being improved to be more accurate, their model's size is exponentially growing. Thus, a huge number of parameters requires to be loaded and stored from/in memory hierarchy and computed in processors to perform training or inference phase of neural network processing. Increasing the number of parameters causes a big challenge for real-time deployment since the memory bandwidth improvement's trend cannot keep up with models' complexity growing trend. Although some operations in neural networks processing are computational intensive such as convolutional layer computing, computing dense layers face with memory bandwidth bottleneck. To address the issue, the paper has proposed Partition Pruning for dense layers to reduce the required parameters while taking into consideration parallelization. We evaluated the performance and energy consumption of parallel inference of partitioned models, which showed a 7.72x speedup of performance and a 2.73x reduction in the energy used for computing pruned fully connected layers in TinyVGG16 model in comparison to running the unpruned model on a single accelerator. Besides, our method showed a limited reduction in accuracy while partitioning fully connected layers.

Cover page: Partition Pruning: Parallelization-Aware Pruning for Dense Neural Networks

Creative Commons 'BY' version 4.0 license