Reconstructing scenes or objects from observed images has long been a critical problem in the graphics and vision community.Traditional methods solve the inverse problem by having a large number of images to resolve the ambiguity in geometry and appearance.
However, capturing and storing complex data consumes extensive compute resources, and it is infeasible for consumer-grade hardware.
This dissertation presents several algorithms to reconstruct the 3D geometry and appearance from a handful of input views, allowing efficient data capture, storage and generalization to unseen scenes.
Starting with scene reconstruction, we first aim at data captured by 360° cameras.We introduce multi depth panoramas, a compact representation to enable translational and rotational movements in the 3D scene.
We leverage multi-view stereo (MVS) techniques and deep neural networks to promote 16 input views into a panoramic representation that could efficiently render convincing visual results with a small storage requirement.
Furthermore, we explore a harder problem by reducing the input view count to only 2 and capturing scenes with dynamic components.We present deep 3D mask volume, a novel representation to ensure temporally stable renderings for view extrapolation.
Our network takes information from the video frames to infer disocclusion caused by the moving objects.
Then it produces a 3D mask volume to clean up the disoccluded regions with the temporally stable background content, producing flicker-free visual results.
Next, we focus on human portraits and seek to change the viewpoint and the lighting at the same time.We develop neural light-transport field (NeLF).
This representation is trained on synthetic human portraits to generate novel views under novel lighting from only 5 input images.
Finally, we investigate the 3D reconstruction problem where only a single image is given.To this end, we present VisionNeRF.
This algorithm combines the expressiveness and capacity from vision transformers and the high-fidelity rendering from volumetric representation to synthesize unseen views of a given object.