Smart IoT devices, smartphones, and wearables are penetrating every aspect of our daily lives. These devices are equipped with various sensing modalities, including video, audio, inertial sensors, lidars, etc., that enable multiple sensing applications. Research has shown that rather than operating each sensor in isolation, combining information from multiple sensing streams boosts performance. This method is known as multimodal sensor fusion and Human Activity Recognition(HAR) is one of the applications that benefit from using multiple sensors. In recent years, deep learning algorithms have been shown to achieve high accuracies in HAR using multimodal sensor data. However, in order to design a reliable HAR system, the following challenges still need to be addressed. The first challenge is the heterogeneity of the sensing devices. This arises as the set of devices monitoring a person may vary over time or the devices might have different sampling frequencies. And the second challenge is deep neural networks (DNNs) are considered black boxes because studying their structure often provides little to no insight into the actual underlying mechanics. It is hard to look "into" the network and ascertain why the model selects specific features over others during training, thereby making the predictions from the DNNs not trustworthy to the end-users. This lack of trust prevents the adoption of DNN models in health-related applications and other high-stakes applications where sensitive decisions mandate a sufficient accompanying explanation. Therefore, this dissertation proposes methods to generate accurate predictions robust to the heterogeneity of devices by making opportunistic use of information from available devices and providing human-understandable explanations accompanying each prediction to the end-users.
First, we propose a solution to address the challenges related to the heterogeneity in sensor devices for activity recognition in our work 'SenseHAR.'We design a scalable deep learning-based solution in which each device learns its own sensor fusion model that maps the raw sensor values to a shared low dimensional latent space which we call the `SenseHAR'-a virtual activity sensor. The virtual sensor has the same format and behavior regardless of the subset of devices, sensors availability, sampling rate, or device location. \emph{SenseHAR} helps machine learning engineers to develop their application-specific (e.g., from gesture recognition to activities of daily life) models in a hardware-agnostic manner based on this virtual activity sensor.
Next, we address the issue of explainability for activity recognition in deep learning models. We first identify the most preferred post-hoc explanation technique for classification tasks across different modalities from an end-user perspective. To this end, we conducted a large-scale Amazon Mechanical Turk study comparing the popular state-of-the-art explanation methods to determine which are better for explaining model decisions empirically. Our results show that Explanation by examples was the most preferred type of Explanation. We also offer an open-source library \emph{ExMatchina}, providing a readily available and widely applicable implementation of explanation-by-examples. Then, we focus on interpretable DNN models, especially models that provide concept-based explanations. We proposed \emph{CoDEx}, an automatic Concept Discovery and Extraction module that identifies a rich set of complex concepts from natural language explanations of videos--obviating the need to predefine the amorphous set of concepts. Finally, we introduce \emph{XCHAR}, an Explainable Complex Human Activity Recognition model that accurately predicts complex activities and provides explanations in the form of human-understandable temporal concepts.