- Main
Towards Better and Privacy-Preserving Speech Modeling for Depression Detection
- Wang, Jinhan
- Advisor(s): Alwan, Abeer
Abstract
Automatic depression detection systems based on speech signals have recently garnered significant attention. Depression modeling from speech signals, however, faces three challenges. The first challenge is data scarcity. The second, is the risk of privacy exposure in depression detection systems. The third, is the lack of consideration of non-uniformly distributed depression patterns within speech signals. In this dissertation, we address these challenges so that better and privacy-preserving speech-based depression detection systems are built.
To address the data scarcity issue, we propose a modified Instance Discriminative Learning (IDL) pre-training method to enable the model to extract augment-invariant and instance-spread-out embeddings from pre-training tasks using out-of-domain unlabeled data. The pre-trained model is then used for initialization of the downstream model, DepAudioNet, and fine-tuned for depression detection tasks. We investigate different augmentation techniques and instance sampling strategies in the pre-training stage. Specifically, we propose a novel sampling strategy, Pseudo-Instance-based Sampling (PIS), to further reveal the correlation between depression characteristics with the underlying acoustic units.Second, to address the privacy-preservation issue, we propose a novel non-uniform speaker disentanglement (NUSD) adversarial learning framework to disentangle speaker-identity information from depression characteristics. The approach utilizes idiosyncratic behaviors of different layers of detection models, and varies the adversarial disentanglement strength of different model components. The method shows that depression detection can be done without an over-reliance on speaker-identity features. More importantly, we found that attenuating more speaker information in the Feature Extraction (FE) module yields better performance than assigning the disentanglement weights uniformly.
Third, to address the non-uniformity of depression patterns in speech signals, we propose a novel framework. The framework, Speechformer-CTC, models dynamically varying depression characteristics within speech segments using a Connectionist Temporal Classification (CTC) objective function without the necessity of input-output alignment. Two novel CTC-label generation policies, namely the Expectation-One-Hot and the HuBERT policies, are proposed and incorporated in objectives at various granularities. Additionally, experiments using Automatic Speech Recognition (ASR) features are conducted to demonstrate the compatibility of the proposed method with content-based features. Our findings show that depression detection can benefit from modeling non-uniformly distributed depression patterns and the proposed framework can potentially be used to determine significant depressive regions in speech utterances.
Experiments show that the proposed techniques achieve state-of-the-art performance and are validated for both English and Mandarin Chinese.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-