- Main
Towards Efficient Deep Learning for Human-Centric Visual Understanding and Generation
- Ma, Haoyu
- Advisor(s): Xie, Xiaohui
Abstract
Human-centric visual understanding and generation are pivotal in many real scenarios, such as augmented/virtual reality, human-computer interaction, and movie industry. Over the past several years, deep learning has become the dominant approach for many human-centric visual tasks, such as pose estimation, avatar reconstruction, and character animation. Despite previous progress, these tasks remain challenging under occlusion and motion blur. Besides, most of current models are computationally extensive, which hinders real-world deployment. In this dissertation, we propose various approaches to address the aforementioned challenges in order to achieve better accuracy and higher efficiency. In the first part, the task of pose estimation (a.k.a. keypoint detection) would be widely investigated, which serves as the foundation of many human-related applications. In multi-view settings, we propose a novel transformer-based networks, named TransFusion, which effectively and efficiently fuse global and long-term visual cues from different views, and incorporate the 3D geometry constraints through a novel proposed epipolar field. Besides, we further propose the token-pruned pose transformer, named PPT, to reduce the computation in both monocular and multi-view pose estimation by pruning less important tokens with the cues from the human skeleton prior, all without the need for foreground mask annotations.
In the second part, we investigate human-centric visual editing and generation under diverse conditions. We propose a novel framework named CVThead for reconstructing head avatars from a single image, allowing for the rendering of human heads with various expressions, head shape, and camera views under the control of an explicit head mesh model. CVTHead also utilizes transformers for robust learning of appearance features and enables efficient generation through point-based rendering. Besides, we propose an advanced model that enables appearance editing in video with text instructions. This model, featuring novel non-autoregressive transformers, achieves comparable performance with previous state-of-the-art works while demonstrating a significant acceleration in running time.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-