- Main
Scalable Representations for Vision and Robotics
- Xiao, Tete
- Advisor(s): Darrell, Trevor
Abstract
Artificial intelligence systems have shown remarkable advancements in recent years. However, the challenge of scalability and generalization to real-world problems remains a significant issue. In this thesis, we explore the three key components of building scalable artificial intelligence systems for computer vision, including model optimizability, learning objectives, and large-scale datasets, and apply these outcomes for robotics.
Our work begins with an examination of the optimizability of vision transformers, proposing a new set of optimizability metrics and an alternative design for their patchify stem. Next, we introduce a contrastive self-supervised learning objective that reduces inductive biases in self-supervised learning, resulting in superior performance across various datasets. We then showcase the effectiveness of self-supervised visual pre-training from real-world images for learning motor control tasks from pixels, outperforming supervised baselines and matching oracle state performance.
Expanding on this, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks, demonstrating the effectiveness of pre-trained representations across a range of tasks and embodiments. In addition, we present a sim-to-real learning-based approach for real-world humanoid locomotion using a causal Transformer, marking the first fully learning-based method for real-world full-sized humanoid locomotion. Finally, we conclude the thesis and discuss potential future directions for further research in the field.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-