Efficiency in Computer Vision: From Compute and Memory to Robustness
Skip to main content
eScholarship
Open Access Publications from the University of California

UC Davis

UC Davis Electronic Theses and Dissertations bannerUC Davis

Efficiency in Computer Vision: From Compute and Memory to Robustness

Abstract

Deep learning has been the biggest success story in the past decade. It is now part of nearly every facet of our lives. Along with growth in popularity, deep learning has also seen growth in terms of model and data sizes and the scale of their training. The introduction of transformer architecture a few years ago has proved to be an inflection point and foundation models have taken off since then. The large vision and language models of today contain hundreds of billions of parameters, are trained on trillions of tokens on thousands of GPUs. They are also being deployed at a scale never seen before, including in real-time and safety critical applications. Very soon, each person could have their own customized LLM model to act as their virtual self.

The big improvements in performance of these models also comes with bigger demands in terms of energy, compute, memoryand resources both at the training and inference stages. The general purpose nature of foundation models also implies that they need to be continuously updated on new data. Thus, `training is a one and done process' cannot be a justification for their huge demands. The training and use of these models also comes with a huge carbon footprint, and can have adverse impact on the environment. Thus, there is both a huge opportunity and a need to develop more efficient models.

There are multiple perspectives to efficiency in deep learning, with compute/energy, speed, memory, data and hardware/resources being the most important ones. Their fates are often intertwined, for instance, smaller models require lesser data and computations and are thus faster and can be run on less expensive hardware. The scale of models can also be a limiting factor in many real-time applications. Making them nimble can unlock a host of new applications.

My goal here is to design and develop such efficient deep learning systems, primarily for computer vision applications, increasing their positive impacts and accessibility. My focus will be on the compute and memory efficiency of models with an eye on their robustness and reliability. I present solutions that reduce the training time and hardware requirements of self-supervised representation learning methods of vision and an easy way to distill them to smaller networks. On the memory front, my work includes a way to effectively fine-tune large vision and language models for a specific downstream task with a tiny fraction of the original model as additional parameters. Ideas from fine-tuning can also be easily adopted in active learning and is part of my future work. Similarly, I also target a recent groundbreaking approach for novel view synthesis, reducing its memory bottleneck by compressing it. I intend to extend these works to similarly improve the training and inference times of diffusion models for image generation. While most of my research is on making methods more efficient, it is also necessary to stop and analyse their robustness. My work on adversarial attack on efficient transformers opens up new avenues for research to try and develop better attacks and defenses that target the efficiency of these models. I hope that my work contributes to democratizing AI and has a net positive impact on the planet.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View