- Main
Towards Human-Aligned Vision-Language Models
- Lu, Yujie
- Advisor(s): Wang, William
Abstract
The integration of vision and language stands as a central challenge in advancing artificial intelligence, offering the promise of systems capable of reasoning, planning, and evaluation in a manner akin to human cognition. Despite significant progress, existing models often struggle with fundamental limitations, such as low-resource generalization, procedural reasoning, and alignment with human preferences in real-world settings.
This dissertation addresses these challenges by introducing innovative frameworks that enhance the synergy between visual and linguistic modalities. Central to this work is the concept of augmenting language understanding with visual imagination, enabling models to leverage external generative knowledge for robust inference, particularly in constrained resource scenarios. Complementary to this, a novel approach to temporal reasoning is presented, equipping systems with the ability to distill and sequence relevant evidence across long-form video content, facilitating deeper comprehension of complex visual narratives.
In the domain of procedural planning, we explore how neuro-symbolic reasoning and multimodal prompting can advance goal-directed reasoning. These methods bridge causal gaps in procedural tasks, offering both textual and visual guidance that are temporally coherent and contextually grounded. To support these advancements, we introduce and benchmark against new datasets designed to challenge existing paradigms.
Finally, this work redefines evaluation methodologies in generative AI by aligning metrics with human judgments. By incorporating large language models and leveraging human preference benchmarks, we uncover critical gaps in current vision-language systems, revealing avenues for improving robustness, reliability, and safety in real-world applications.
The insights and methodologies presented in this dissertation contribute to the development of vision-and-language models that are more aligned with human reasoning and societal needs, advancing the field toward more trustworthy and capable AI systems.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-