This webpage presents the results of a study regarding our EMG-to-Speech model, in the context of training it using data from different speakers. Our dataset includes six speakers: three with larger amounts of data (speakers 001, 002, and 005) and three with limited data (speakers 003, 004, and 006). Speaker 001 has the most extensive data.
We trained a mono-speaker model for each of the three speakers with more data, a multi-speaker model using all six speakers, and a multi-speaker model that a speaker embedding vector alongside the EMG signal, to assess whether it enhances performance. These models were evaluated on speakers 001, 002, and 005.
The results show that for speaker 001, the speaker-dependent model performs best, likely due to the larger amount of available data. For speakers 002 and 005, the multi-speaker model generates more accurate mel spectrogram predictions, but the speaker-dependent models better preserve the speaker’s vocal identity. Due to the poor quality of the synthesized speech, objective evaluation metrics were not applied to speakers with limited data. However, you can still listen to their results on this webpage.
For the examples on this page, we used EMG signals from each speaker silently articulating the same five different sentences. The resulting audio files were generated by processing these EMG signals through each of the trained models. Speaker-dependent models were evaluated only on their corresponding speakers. The transcriptions of the five examples are provided at the bottom of the page.
The next table presents the amount of data from each speaker included in the training and validation sets. In the column that specifies the amount of utterances used for each speaker, V refers to utterances pronounced audibly and S refers those pronounced in silence.
Speaker | Set | Audio duration | EMG duration | Utterances |
---|---|---|---|---|
001 | Train | 2:21:19 | 4:00:55 | 2151 V + 1093 S |
Val | 0:00:51 | 15 S | ||
002 | Train | 2:35:06 | 2:57:38 | 1735 V + 255 S |
Val | 0:01:16 | 15 S | ||
003 | Train | 0:47:42 | 0:52:29 | 875 V + 85 S |
Val | 0:00:16 | 5 S | ||
004 | Train | 0:54:08 | 0:59:15 | 875 V + 85 S |
Val | 0:00:17 | 5 S | ||
005 | Train | 1:54:32 | 2:11:26 | 1735 V + 255 S |
Val | 0:00:58 | 15 S | ||
006 | Train | 1:02:11 | 1:09:10 | 874 V + 84 S |
Val | 0:00:24 | 5 S | ||
Total | Train | 9:34:58 | 12:10:53 | 8245 V + 1857 S |
Val | 0:04:02 | 60 S |
Mono-speaker model
Example 1
Example 2
Example 3
Example 4
Example 5
Multi-speaker model
Example 1
Example 2
Example 3
Example 4
Example 5
Multi-speaker with speaker embeddings
Example 1
Example 2
Example 3
Example 4
Example 5
Mono-speaker model
Example 1
Example 2
Example 3
Example 4
Example 5
Multi-speaker model
Example 1
Example 2
Example 3
Example 4
Example 5
Multi-speaker with speaker embeddings
Example 1
Example 2
Example 3
Example 4
Example 5
Mono-speaker model
Example 1
Example 2
Example 3
Example 4
Example 5
Multi-speaker model
Example 1
Example 2
Example 3
Example 4
Example 5
Multi-speaker with speaker embeddings
Example 1
Example 2
Example 3
Example 4
Example 5