Finding local feature correspondences is recognized as one of the fundamental tasks in computer vision. It serves as an unalienable component for various applications in both 2D and 3D regimes. Such applications include but are not limited to structure from motion, image registration, image retrieval, 3D scene registration, SLAM, etc. In this dissertation, we study approaches to improve the overall performance of establishing local correspondences in various contexts. These contexts can be defined and categorized along multiple dimensions, for example, sparse v.s. dense feature, 2D v.s. 3D feature. To be more specific, the contexts we address in this thesis are sparse 2D local feature, dense 2D local feature, and sparse 3D local feature.
There has been abundant research on the above topics and we recently see a ubiquitous trend of learning-based approaches taking over the handcrafted ones. While these learned methods showcase exceptional matching performance and additional robustness against both geometric and photometric distortions, we notice some merits of the good old handcrafted methods are partially comprised or not inherited at all. We investigate the impact by incorporating such merits into each context. For sparse 2D local feature, we first propose an explicit scheme that trains a network to estimate orientation for each feature. Encouraged by the primary results, we further propose a novel pipeline that accurately estimates the relative affine transformation between feature pairs during the matching process and restore the geometric distortion for the corresponding pairs. For dense 2D local feature, we propose a pyramid structured network and a novel concept called motion cue to compute optical flow for video data. For sparse 3D local feature, we propose an occlusion-aware voxelization method with a multiple-resolution convolution network to deal with the occlusion challenge.
For the last two chapters, we include two works on related applications: view synthesis and stereo panoramic imaging. Our view synthesis algorithm addresses the challenging occlusion filling problem with a hierarchical clustering approach. Stereo panorama is a popular content in virtual reality. We present a systematic solution that captures the surrounding environment with camera array and providing users an immersive viewing experience with the freedom of selecting the preferred perspective and baseline.