Language is such a powerful representation for capturing the knowledge and information about our world. It excels at expressing discrete concepts such as objects and their attributes, the relationships between them in a very compact manner all due to its extremely high level of abstraction. Language is the primary means by which we communicate, comprehend, and express our thoughts and ideas, and it lies at the very core of human intelligence. With the advent of powerful generative models, machines also have begun to comprehend and generate natural language with notable fluency and creativity. However, they lack “grounding”—a direct tie to the visual world. Vision plays a pivotal role in our comprehension and production of language. When we describe a scene, understand instructions, or engage in a dialogue, visual contextsignificantly aids our interpretation and generation of language. This highlights the need for integrating vision for generative modeling.
Chapter 1 and 2 delve into image-to-text domain, spotlighting the importance of a multimodal approach for text generation. In Chapter 1, we explore how generating textual rationales with attention visualizations can enhance model transparency for
visual question answering. In Chapter 2, we build generative models that abandon traditional left-to-right sequencing in favor of an unsupervised technique to determine optimal generation orders. Chapter 3 and 4 shift the focus to text-to-image generation. In Chapter 3, we introduce a training-free framework that combines linguistic cues with reference images, allowing for controllable image synthesis using denoising diffusion probabilistic models. Lastly, Chapter 4 emphasizes the importance of preserving object shapes in text-based image editing, proposing a unique mechanism that augments text-to-image models to be more faithful to input masks and text prompts.