The emergence of Transformers revolutionized the landscape of Natural Language processing (NLP) algorithms, supplanting traditional approaches like LSTM and RNN withTransformer-based models such as BERT and GPT. Despite their superior performance, the
substantial memory storage demands and computational complexity of Transformer-based
algorithms pose challenges for their deployment in embedded devices. For instance, GPT-3
alone requires storing and fetching 175 billion parameters, leading to memory bottlenecks
and heightened power consumption due to frequent parameter transfers between computational units and memory. The success of Transformer-based models in NLP has spurred
their application to computer vision tasks as well. Despite their origins in language processing, models like BERT and GPT have been adapted for image classification, segmentation,
and object detection tasks due to their remarkable performance. However, the associated
challenges of memory storage requirements and computational complexity persist, hindering
their effective deployment on embedded devices in the vision domain.
This study tackles the challenges faced by vision Transformer algorithms by introducing
an innovative approach to crafting energy-efficient dynamically prunable Vision transformers tailored for edge applications. Termed Incremental Resolution Enhancing Transformer
(IRET), our method revolves around sequentially sampling the input image. Notably, our
solution utilizes smaller embedding sizes for input tokens in comparison to previous approaches. These embeddings are employed in the initial layers of the IRET vision transformer until a robust attention matrix is established. Subsequently, this attention matrix
guides the sampling of additional information via a learnable 2D lifting scheme, focusing
solely on important tokens while dropping those with low attention scores. As the model
concentrates more on a subset of tokens, its attention and resolution naturally amplify. This
incremental attention-driven input sampling and token-dropping mechanism enables IRET to
prune its computation tree significantly as needed. By adjusting the threshold for discarding
unattended tokens and augmenting the focus on attended ones, we can train a model that
dynamically balances complexity with accuracy. Additionally, we explore the application
of dynamic inference techniques to the model, enabling it to predict outcomes early. This
feature is particularly advantageous for edge devices, where the trade-off between accuracy
and complexity can be dynamically adjusted based on factors like battery life and reliability.