Search

Scholarly Works (15 results)

Sort By:

Show:

Thesis
Peer Reviewed

Generative Models for Image and Long Video Synthesis

Brooks, Tim
Advisor(s): Efros, Alexei A

UC Berkeley Electronic Theses and Dissertations (2023)

In this thesis, I present essential ingredients for making image and video generative models useful for general visual content creation through three contributions. First, I will present research on long video generation. This work proposes a network architecture and training paradigm that enables learning long-term temporal patterns from videos, a key challenge to advancing video generation from short clips to longer-form coherent videos. Next, I will present research on generating images of scenes conditioned on human poses. This work showcases the ability of generative models to represent relationships between humans and their environments, and emphasizes the importance of learning from large and complex datasets of daily human activity. Lastly, I will present a method for teaching generative models to follow image editing instructions by combining the abilities of large language models and text-to-image models to create supervised training data. Following instructions is an important step that will allow generative models of visual data to become more helpful to people. Together these works advance the capabilities of generative models for synthesizing images and long videos.

Cover page: Generative Models for Image and Long Video Synthesis

Thesis
Peer Reviewed

Visual Learning Beyond Direct Supervision

Zhou, Tinghui
Advisor(s): Efros, Alexei A

UC Berkeley Electronic Theses and Dissertations (2018)

Deep learning has made great progress in solving many computer vision tasks for which labeled data is plentiful. But progress has been limited for tasks where labels are difficult or impossible to obtain. In this thesis, we propose alternative methods of supervised learning that do not require direct labels. Intuitively, although we do not know what the labels are, we might know various properties they should satisfy. The key idea is to formulate these properties as objectives for supervising the target task. We show that this kind of “meta-supervision” on how the output behaves, rather than what it is, turns out to be surprisingly effective in learning a variety of vision tasks.

The thesis is organized as follows. Part I proposes to use the concept of cycle-consistency as supervision for learning dense semantic correspondence. Part II proposes to use the task of view synthesis as supervision for learning different representations of scene geometry. Part III proposes to use adversarial supervision for learning gradual image transformations. Finally, we discuss the general concept of meta-supervision and how it can be applied to tasks beyond those presented in this thesis.

Cover page: Visual Learning Beyond Direct Supervision

Thesis
Peer Reviewed

Learning to Synthesize and Manipulate Natural Images

Zhu, Junyan
Advisor(s): Efros, Alexei A

UC Berkeley Electronic Theses and Dissertations (2017)

Humans are avid consumers of visual content. Every day, people watch videos, play digital games and share photos on social media. However, there is an asymmetry -- while everybody is able to consume visual data, only a chosen few are talented enough to effectively express themselves visually. For the rest of us, most attempts at creating or manipulating realistic visual content end up quickly ``falling off'' the manifold of natural images. In this thesis, we investigate a number of data-driven approaches for preserving visual realism while creating and manipulating photographs. We use these methods as training wheels for visual content creation. We first propose to model visual realism directly from large-scale natural images. We then define a class of image synthesis and manipulation operations, constraining their outputs to look realistic according to the learned models. The presented methods not only help users easily synthesize more visually appealing photos but also enable new visual effects not possible before this work.

Part I describes discriminative methods for modeling visual realism and photograph aesthetics. Directly training these models requires expensive human judgments. To address this, we adopt active and unsupervised learning methods to reduce annotation costs. We then apply the learned model to various graphics tasks, such as automatically generating image composites and choosing the best-looking portraits from a photo album.

Part II presents approaches that directly model the natural image manifold via generative models and constrain the output of a photo editing tool to lie on this manifold. We build real-time data-driven exploration and editing interfaces based on both simpler image averaging models and more recent deep models.

Part III combines the discriminative learning and generative modeling into an end-to-end image-to-image translation framework, where a network is trained to map inputs (such as user sketches) directly to natural looking results. We present a new algorithm that can learn the translation in the absence of paired training data, as well as a method for producing diverse outputs given the same input image. These methods enable many new applications, such as turning user sketches into photos, season transfer, object transfiguration, photo style transfer, and generating real photographs from painting and computer graphics renderings.

Cover page: Learning to Synthesize and Manipulate Natural Images

Thesis
Peer Reviewed

Machine Learning for Deep Image Manipulation

Park, Taesung
Advisor(s): Efros, Alexei A

UC Berkeley Electronic Theses and Dissertations (2021)

Common types of image editing methods focus on low-level characteristics. In this thesis, I leverage machine learning to enable image editing that operates at a higher conceptual level. Fundamentally, the proposed methods aim to factor out the visual information that must be maintained in the editing process from the information that may be edited by incorporating the generic visual knowledge. As a result, the new methods can transform images in human-interpretable ways, such as turning one object into another, stylizing photographs into a specific artist's paintings, or adding sunset to a photo taken in daylight. We explore designing such methods in different settings with varying amounts of supervision: per-pixel labels, per-image labels, and no labels. First, using per-pixel supervision, I propose a new deep neural network architecture that can synthesize realistic images from scene layouts and optional target styles. Second, using per-image supervision, I explore the task of domain translation, where an input image of one class is transformed into another. Lastly, I design a framework that can still discover disentangled manipulation of structure and texture from a collection of unlabeled images. We present convincing visuals in a wide range of applications including interactive photo drawing tools, object transfiguration, domain gap reduction between virtual and real environment, and realistic manipulation of image textures.

Cover page: Machine Learning for Deep Image Manipulation

Thesis
Peer Reviewed

Image Synthesis for Self-Supervised Visual Representation Learning

Zhang, Richard
Advisor(s): Efros, Alexei A.

UC Berkeley Electronic Theses and Dissertations (2018)

Deep networks are extremely adept at mapping a noisy, high-dimensional signal to a clean, low-dimensional target output (e.g., image classification). By solving this heavy compression task, the network also learns about natural image priors. However, this process requires the curation of large, labeled datasets. Meanwhile, the world provides massive amounts of raw, unlabeled pixels for free. This thesis investigates learning representations of high-dimensional input signals by mapping them to \textit{high-dimensional} output targets. While more difficult, it is not only possible to learn a strong feature representation, but also to synthesize realistic images.

Part I describes the use of deep networks for conditional image synthesis. The section begins by exploring the problem of image colorization, proposing both automatic and user-guided approaches. This section then proposes a system for general image-to-image translation problems, BicycleGAN, with the specific aim of capturing the multimodal nature of the output space.

Part II explores the visual representations learned within deep networks. Colorization, as well as cross-channel prediction in general, is a simple but powerful pretext task for self-supervised learning. The representations from cross-channel prediction networks transfer strongly to high-level semantic tasks, such as image classification, and to low-level human perceptual similarity judgments. For the latter, a large-scale dataset of human perceptual similarity judgments is collected. The proposed cross-channel network method outperforms traditional metrics such as PSNR and SSIM. In fact, many unsupervised and self-supervised methods transfer strongly, even comparably to fully-supervised methods.

Cover page: Image Synthesis for Self-Supervised Visual Representation Learning

Thesis
Peer Reviewed

Disentangled Visual Generative Models

Epstein, Dave
Advisor(s): Efros, Alexei A.

UC Berkeley Electronic Theses and Dissertations (2024)

Generative modeling promises an elegant solution to learning about high-dimensional data distributions such as images and videos --- but how can we expose and utilize the rich structure these models discover? Rather than just drawing new samples, how can an agent actually harness p(x) as a source of knowledge about how our world works? This thesis explores scalable inductive biases that unlock a generative model's understanding of the entities latent in visual data, enabling much richer interaction with the model as a result.

Cover page: Disentangled Visual Generative Models

Thesis
Peer Reviewed

Scalable Binding

Jabri, Allan Anwar
Advisor(s): Efros, Alexei A

UC Berkeley Electronic Theses and Dissertations (2023)

Any useful agent will face many tasks and must rely on transfer of prior knowledge acquired in a scalable manner. This thesis explores inductive biases that enable scalable pre-training of representations -- and algorithms that bind them -- from the design of architectures capable of adaptive computation for scalable generative modeling, to self-supervised objectives that prepare embodied agents with mechanisms for state representation and reward maximization.

First, I consider the challenge of gracefully scaling generative models to high-dimensional data, motivating the importance of adaptive computation, a property missing from predominant architectures. This leads to a simple attention-based architecture for diffusion models capable of dedicating computation adaptively across its input and output, attaining superior performance in image and video generation despite being more domain-agnostic and efficient. Visualizations of read attention demonstrate how the model learns to dedicate computation to more complex parts of samples; e.g. in cases of high redundancy such as video prediction, it learns to simply copy information when appropriate and focus computation on more complex dynamics.

Next, I show how self-supervised objectives that exploit more domain knowledge can be used to efficiently solve related downstream tasks. In the domain of perception, I show how a simple self-supervised objective for space-time attention can be used to solve a range of tasks involving temporal correspondence and object permanence, central challenges in state representation for embodied agents. In the domain of reinforcement learning, I motivate the importance of scalable construction of task distributions and demonstrate how meta-reinforcement learners -- and underlying exploration and stimulus-reward binding mechanisms -- can be pre-trained with self-supervised reward models.

Finally, I conclude with a perspective on open problems in scalable pre-training, with a focus on the interplay between transfer across modalities, universal generative modeling objectives for discrete and continuous data, and adaptive computation.

Thesis
Peer Reviewed

Generative Models of Images and Neural Networks

Peebles, William Smith
Advisor(s): Efros, Alexei A

UC Berkeley Electronic Theses and Dissertations (2023)

Large-scale generative models have fueled recent progress in artificial intelligence. Armed with scaling laws that accurately predict model performance as invested compute increases, NLP has become the gold standard for all disciplines of AI. Given a new task, pre-trained generative models can either solve it zero-shot or be efficiently fine-tuned on a small amount of task-specific training examples. However, the widespread adoption of generative models has lagged in other domains---such as vision and meta-learning. In this thesis, we study ways to train improved, scalable generative models of two modalities---images and neural network parameters. We also examine how pre-trained generative models can be leveraged to tackle additional downstream tasks.

We begin by introducing a new, powerful class of generative models---Diffusion Transformers (DiTs). We show that transformers---with one small yet critically-important modification---retain their excellent scaling properties for diffusion-based image generation and outperform convolutional neural networks that have previously dominated the area. DiT outperforms all prior generative models on the class-conditional ImageNet generation benchmark.

Next, we introduce a novel framework for learning to learn based on building generative models of a new data source---neural network checkpoints. We create datasets containing hundreds of thousands of deep learning training runs and use it to train generative models of neural network checkpoints. Given a starting parameter vector and a target loss, error or reward, loss-conditional diffusion models trained on this data can sample parameter updates that achieve a desired metric. We apply our framework to problems in vision and reinforcement learning.

Finally, we explore how pre-trained image-level generative models can be used to tackle downstream tasks in vision without requiring task-specific training data. We show that pre-trained GAN generators can be used to create an infinite data stream to train networks for the dense visual correspondence problem---without requiring any human-annotated supervision like keypoints. Networks trained on this completely GAN-generated data generalize zero-shot to real images, and they outperform previous self-supervised and keypoint-supervised approaches that train on real data.

Cover page: Generative Models of Images and Neural Networks

Thesis
Peer Reviewed

Modeling Visual Minutiae: Gestures, Styles, and Temporal Patterns

Ginosar, Shiry Sara
Advisor(s): Efros, Alexei A

UC Berkeley Electronic Theses and Dissertations (2020)

The human visual system is highly adept at making use of the rich subtleties of the visual world such as non-verbal communication signals, style, emotion, and the fine-grained details of individuals. Computer vision systems, by contrast, excel in categorical tasks, such as classification and detection, where training often relies on single-word or simple bounding-box annotations. These simple annotations do not capture the richness of the visual world which is often hard to describe in words or localize in an image. Our current systems are thus left to only make use of the obvious, easily describable parts of the visual input. This dissertation investigates several initial directions toward modeling visual minutiae and endowing computer vision systems with rich perception.

Part I describes methods for learning directly from video data without the need for human-provided annotations. The section begins by discussing the use of multi-modal correlations between audio and motion for modeling conversational gestures---an essential part of human communication that is currently ignored by machine perception. The section then proposes a simple method for capturing the appearance details of individual people in motion, which can be used to implement a "do-as-I-do'' motion-transfer application.

Part II explores ways to discover temporal visual patterns in historical data. The section begins by discussing data-mining methods in a dataset of historical high school yearbook portraits where fashion and behavioral styles change over time. The rest of the section proposes an unsupervised method to learn to disentangle the time-varying visual factors from the permanent ones in a large dataset of urban scenes.

Part III discusses one possible avenue for testing whether our man-made systems have achieved human-like rich perception by comparing their performance to that of humans on a unique dataset of abstract art.

Cover page: Modeling Visual Minutiae: Gestures, Styles, and Temporal Patterns

Thesis
Peer Reviewed

Learning to Generalize via Self-Supervised Prediction

UC Berkeley Electronic Theses and Dissertations (2019)

Generalization, i.e., the ability to adapt to novel scenarios, is the hallmark of human intelligence. While we have systems that excel at recognizing objects, cleaning floors, playing complex games and occasionally beating humans, they are incredibly specific in that they only perform the tasks they are trained for and are miserable at generalization. Could optimizing towards fixed external goals be hindering the generalization instead of aiding it? In this thesis, we present our initial efforts toward endowing artificial agents with a human-like ability to generalize in diverse scenarios. The main insight is to first allow the agent to learn general-purpose skills in a completely self-directed manner, without optimizing for any external goal.

To be able to learn on its own, the claim is that an artificial agent must be embodied in the world, develop an understanding of its sensory input (e.g., image stream) and simultaneously learn to map this understanding to its motor outputs (e.g., torques) in an unsupervised manner. All these considerations lead to two fundamental questions: how to learn rich representations of the world similar to what humans learn?; and how to re-use such a representation of past knowledge to incrementally adapt and learn more about the world similar to how humans do? We believe prediction is the key to this answer. We propose generic mechanisms that employ prediction as a self-supervisory signal in allowing the agents to learn sensory representations as well as motor control. These two abilities equip an embodied agent with a basic set of general-purpose skills which are then later repurposed to perform complex tasks.

We discuss how this framework can be instantiated to develop curiosity-driven agents (virtual as well as real) that can learn to play games, learn to walk, and learn to perform real-world object manipulation without any rewards or supervision. These self-directed robotic agents, after exploring the environment, can generalize to find their way in office environments, tie knots using rope, rearrange object configuration, and compose their skills in a modular fashion.

Cover page: Learning to Generalize via Self-Supervised Prediction