This thesis concerns the task of turning silently mouthed words into audible speech. By using sensors that measure electrical signals from muscle movement (electromyography or EMG), it is possible to capture articulatory information from the face and neck that pertains to speech. Using these signals, we aim to train a machine learning model to generate audio in the original speaker's voice that corresponds to words that were silently mouthed. We call this task voicing silent speech.
Voicing silent speech has a wide array of potential real-world applications. For example, it could be used to allow phone or video conversations where other people around the person speaking can't hear anything they say, or it could be useful in some clinical applications for people who can't speak normally but still have use of most of their speech articulators.
There have been several papers in the past that have looked at the problem of converting EMG signals to speech. However, these prior EMG-to-speech works have focused on the artificial task of recovering audio from EMG that was recorded during normal vocalized speech. In this work, we will instead generate speech from recordings where no actual sound was produced. Models trained only on vocalized speech perform poorly when applied to silent speech due to signal differences between the two modes. Our work is the first to train a model on EMG from silent speech, allowing us to overcome these signal differences.
Training with EMG from silent speech is more challenging than with EMG from vocalized speech, because when training on vocalized EMG data we have time-aligned speech targets but when training on silent EMG data there is no simultaneous audio. Our solution is to adopt a target-transfer approach, where audio output targets are transferred from vocalized recordings to silent recordings of the same utterances. To do this cross-modal training, we need to account for the fact that the two recordings are not time-aligned, so a core component of our work concerns finding the best way to align the vocalized speech targets with the silent utterances.
To enable development on this task, we collect and release a dataset of nearly twenty hours of EMG speech recordings, nearly ten times larger than previous publicly available datasets. We then demonstrate a method for training a speech synthesis model on silent EMG and propose a range of other modeling improvements to make the synthesized outputs more intelligible. We validate our methods with both human and automatic metrics, demonstrating major improvements in intelligibility of generated outputs.