Grounding natural language onto real-world perception is a fundamental challenge to empower various practical applications that require human-machine communication.
This Ph.D. dissertation presents our research thrusts on developing intelligent embodied agents that connect language, vision, and actions. It consists of two major directions towards human-machine communication.
First, from visual recognition to cognition, we study using natural language to describe the visual information and express the understanding of the visual world.
We show that scalable learning approaches based on semantics and external knowledge can generate coherent, fine-grained, and generalizable visual descriptions.
We also investigate the automatic evaluation metrics for language generation, and propose an adversarial reward learning method to overcome their limitations in optimizing the policy for human-like visual storytelling. Besides, we introduce a multilingual dataset for video-and-language research, which goes beyond monolingual language grounding in vision.
Second, in order to connect language and vision to actions, we have situated natural language inside interactive environments where communication often occurs.
We systematically study the task of vision-language navigation, aiming to associate human commands with visual perception and navigate the 3D world.
Then we demonstrate novel methods to tackle generalization and data scarcity issues from various perspectives like counterfactual thinking, transfer learning, multitask learning and agnostic learning. Our efforts shed light on generalizing embodied navigation agents to more challenging and practical scenarios.
Finally, we summarize the strengths, weaknesses, and implications of our work, and discuss the future research plan of pushing embodied AI research further.