A Study Of Small Evolution of Vision Transformers For Low Power Devices
Skip to main content
eScholarship
Open Access Publications from the University of California

UC Davis

UC Davis Electronic Theses and Dissertations bannerUC Davis

A Study Of Small Evolution of Vision Transformers For Low Power Devices

No data is associated with this publication.
Abstract

The emergence of Transformers revolutionized the landscape of Natural Language processing (NLP) algorithms, supplanting traditional approaches like LSTM and RNN withTransformer-based models such as BERT and GPT. Despite their superior performance, the substantial memory storage demands and computational complexity of Transformer-based algorithms pose challenges for their deployment in embedded devices. For instance, GPT-3 alone requires storing and fetching 175 billion parameters, leading to memory bottlenecks and heightened power consumption due to frequent parameter transfers between computational units and memory. The success of Transformer-based models in NLP has spurred their application to computer vision tasks as well. Despite their origins in language processing, models like BERT and GPT have been adapted for image classification, segmentation, and object detection tasks due to their remarkable performance. However, the associated challenges of memory storage requirements and computational complexity persist, hindering their effective deployment on embedded devices in the vision domain. This study tackles the challenges faced by vision Transformer algorithms by introducing an innovative approach to crafting energy-efficient dynamically prunable Vision transformers tailored for edge applications. Termed Incremental Resolution Enhancing Transformer (IRET), our method revolves around sequentially sampling the input image. Notably, our solution utilizes smaller embedding sizes for input tokens in comparison to previous approaches. These embeddings are employed in the initial layers of the IRET vision transformer until a robust attention matrix is established. Subsequently, this attention matrix guides the sampling of additional information via a learnable 2D lifting scheme, focusing solely on important tokens while dropping those with low attention scores. As the model concentrates more on a subset of tokens, its attention and resolution naturally amplify. This incremental attention-driven input sampling and token-dropping mechanism enables IRET to prune its computation tree significantly as needed. By adjusting the threshold for discarding unattended tokens and augmenting the focus on attended ones, we can train a model that dynamically balances complexity with accuracy. Additionally, we explore the application of dynamic inference techniques to the model, enabling it to predict outcomes early. This feature is particularly advantageous for edge devices, where the trade-off between accuracy and complexity can be dynamically adjusted based on factors like battery life and reliability.

Main Content

This item is under embargo until May 15, 2025.