Deep learning models have become the backbone of modern computer vision systems, achieving striking success in a wide range of tasks from image to video understanding. While the benchmark performance of deep neural networks never appears to saturate as they are fed more data and compute, real-world applications of these models often fall short of expectations. This is due to the presence of biases in the data, which misrepresent the underlying distribution and lead to poor generalization and discriminatory outcomes. In this thesis, we investigate the problem of bias in vision and multimodal learning systems, proposing methods to identify, measure, and mitigate biases in both the data and the models.
The first part of the dissertation introduces a new form of bias, representation bias, which measures the extend to which ground-truth labels can be inferred from spurious features in the data. We apply the concept to study the static bias in video action recognition, and propose a procedure, RESOUND, to guide the collection of datasets free from representation bias. We further develop a debiasing method, REPAIR, to mitigate the bias in existing datasets, and show that debiasing the training data can improve the generalization of video models on new datasets.
The second part shifts the focus to model-centric approaches that alleviate the vulnerability of models to data biases. We propose dynamic representation learning, a novel framework for quantifying and minimizing the static bias of video classification models, and use this to study the impact of bias removal on the transferability of video representations. We then extend the debiasing to multimodal learning, proposing a sparse video-text transformer for efficient modeling over long clips, and a training curriculum that enables temporal learning beyond static visual features.
In the final part of the dissertation, we advocate for a holistic view of bias mitigation that considers dataset and model biases jointly, and apply this strategy to improve the fairness of vision-language foundation models using generated counterfactuals. We demonstrate the benefit of this approach in unifying bias mitigation for diverse tasks and domains, and discuss the potential of holistic debiasing in future research.