Real-world scenes are typically complex and dynamic. Intelligent systems must consistently analyze their surroundings using sensory data to develop situational awareness. Scene analysis based on static images and videos has achieved great success in tasks like object detection and classification. Research efforts have also been made around wireless sensing to acquire more knowledge of the scene, such as recognizing human activities. However, there is still a substantial amount of information from the object that is difficult for standard cameras or existing RF sensing systems to perceive. Meanwhile, while multimodal sensors have been applied to scene analysis, such systems often focus on achieving better performance on a specific sensing task without combining the complementary capabilities of these sensors to fully leverage their strengths. Furthermore, although various deep learning architectures have been proposed for multimodal sensor data processing, they often experience robustness issues, such as degraded performance in adverse sensing conditions. To address these challenges, we investigate the sensing capabilities of radio-frequency (RF) sensors and their fusion with other sensing modalities, such as camera and depth sensors, for deeper scene analysis. Here, our concept of "deep" includes both the richness and robustness of inferences about the scene. Our key insights can be summarized in three parts. First, RF sensors reveal physical phenomenons about objects that are otherwise concealed, such as vibrations.
The dissertation introduces an ultra-wideband (UWB) radar-based wireless vibrometry sensing system, UWHear, which is capable of accurately recovering and separating vibrations from multiple sources and is resilient to non-target noise. Second, the fusion of wireless vibrometry with video scene analysis forms a more comprehensive understanding of the scene. We propose a real-time RF-vision sensor fusion system, Capricorn, that efficiently builds a cross-modal correspondence between visual pixels and RF time series. Capricorn leverages the complimentary sensing capabilities of the RF and vision sensors. It not only captures the extrinsic properties of objects in a scene, such as their shape, type, and location, but also reveals the objects' latent properties, including vibrations and vital signals, providing insights into the objects' internal physical or biological states. Third, we can make a scene analysis system resilient to adverse sensing conditions using a thoughtfully designed deep-learning-based multimodal sensor fusion architecture. We created a multi-node multimodal sensor network system, collected a 9-hour dataset (GDTM) for multimodal object localization, and designed neural network architectures (FlexLoc) for robust multimodal sensor information fusion. Specifically, in FlexLoc, we investigated how to use conditional neural networks to make the deep learning architecture robust to sensor perspective shifts. Through this dissertation, we contribute to the construction of a deeper environment perception with a combination of signal-processing algorithms, deep-learning strategies, and system design.