Deep learning has been the biggest success story in the past decade. It is now part of nearly every facet of our lives. Along with growth in popularity, deep learning has also seen growth in terms of model and data sizes and the scale of
their training. The introduction of transformer architecture a few years ago has proved to be an inflection point
and foundation models have taken off since then. The large vision and language models of today contain hundreds of
billions of parameters, are trained on trillions of tokens on thousands of GPUs. They are also being deployed at a
scale never seen before, including in real-time and safety critical applications. Very soon, each person could
have their own customized LLM model to act as their virtual self.
The big improvements in performance of these models also comes with bigger demands in terms of energy, compute, memoryand resources both at the training and inference stages. The general purpose nature of foundation models also implies
that they need to be continuously updated on new data. Thus, `training is a one and done process' cannot be a
justification for their huge demands. The training and use of these models also comes with a huge carbon footprint, and
can have adverse impact on the environment. Thus, there is both a huge opportunity and a need to develop more efficient
models.
There are multiple perspectives to efficiency in deep learning, with compute/energy, speed, memory, data and hardware/resources being the most important ones. Their fates are often intertwined, for instance, smaller models
require lesser data and computations and are thus faster and can be run on less expensive hardware. The scale of
models can also be a limiting factor in many real-time applications. Making them nimble can unlock a host of
new applications.
My goal here is to design and develop such efficient deep learning systems, primarily for computer vision applications, increasing their positive impacts and accessibility. My focus will be on the compute and memory efficiency of models with
an eye on their robustness and reliability. I present solutions that reduce the training time and hardware requirements
of self-supervised representation learning methods of vision and an easy way to distill them to smaller networks. On the
memory front, my work includes a way to effectively fine-tune large vision and language models for a specific downstream
task with a tiny fraction of the original model as additional parameters. Ideas from fine-tuning can also be easily
adopted in active learning and is part of my future work. Similarly, I also target a recent groundbreaking
approach for novel view synthesis, reducing its memory bottleneck by compressing it. I intend to extend these works to
similarly improve the training and inference times of diffusion models for image generation. While most of my research
is on making methods more efficient, it is also necessary to stop and analyse their robustness. My work on adversarial
attack on efficient transformers opens up new avenues for research to try and develop better attacks and defenses that
target the efficiency of these models. I hope that my work contributes to democratizing AI and has a net positive
impact on the planet.