Ding, Zheng

Controllable and Efficient Visual Generation

2025

Ding, Zheng
Advisor(s): Tu, Zhuowen

Abstract

This dissertation presents several contributions aimed at enhancing visual generation by focusing on controllability and efficiency within computer vision systems. Before delving into the field of visual generation, we will first introduce MaskCLIP, which efficiently leverages pretrained vision-language models for open-vocabulary image segmentation tasks. Following that, we will discuss DiffusionRig, PatchDM, and Gen2Res to showcase the advancements we have made in controllable and efficient image generation. Collectively, the works presented in this dissertation strive to establish vision systems that are both controllable and efficient, facilitating improvements in visual generation as well as understanding.

Chapter 2 introduces a novel task, open-vocabulary universal image segmentation, which aims for semantic, instance, and panoptic segmentation on arbitrarily described categories at inference. We first build a baseline using pre-trained CLIP models and then propose MaskCLIP—a Transformer-based approach featuring a MaskCLIP Visual Encoder that integrates mask tokens with a pre-trained ViT CLIP model. This design enables segmentation and class prediction while efficiently leveraging CLIP’s dense features without resource-intensive student-teacher training.

Chapter 3 introduces DiffusionRig for personalize facial appearance editing. By employing a diffusion model conditioned on rough 3D face models derived from in-the-wild images, it maps simple CGI renderings to realistic images of an individual. DiffusionRig is trained in two stages: first learning generic facial priors from a large-scale dataset, then fine-tuning on limited person-specific photos. This strategy robustly edits facial features while preserving identity and high-frequency details.

Chapter 4 describes Patch-DM, a denoising diffusion model that generates high-resolution images (e.g., 1024$\times$512) using small image patches (e.g., 64$\times$64) during training. It alleviates boundary artifacts in patch-based synthesis by using a novel feature collage strategy that crops and combines overlapping features from neighboring patches to seamlessly predict shifted patches.

Chapter 5 proposes a method for adapting pretrained denoising diffusion models to image restoration tasks. The approach restores images by adding noise to degraded inputs and then denoising them using the pretrained model. By fine-tuning the model on selected anchor images that preserve the input's characteristics, the constrained generative space ensures high-quality restoration that maintains the original identity and overall quality.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC San Diego

Controllable and Efficient Visual Generation