Dr. David Gaddy, from the University of Berkeley, has made significant achievements in EMG-to-Speech synthesis, which we are using as a benchmark. The key distinction between his dataset and ours is the nature of the recordings. He worked with a single-speaker dataset spanning approximately 11 hours, whereas ours comprises multiple speakers with less data per speaker.

To effectively compare the outcomes achievable using his English dataset against ours in a single-speaker context, we needed to make the data volumes equal. This involved reducing his dataset to match the volume of data available for Speaker 001 in our dataset, which is the one with the most data. Subsequently, we trained a model for each dataset using the same architecture to evaluate the quality of results.

| | Original dataset | | Reference dataset (Speaker 001) | | David Gaddy’s reduced dataset | | | --- | --- | --- | --- | --- | --- | --- | | | Audio duration | EMG duration | Audio duration | EMG duration | Audio duration | EMG duration | | Trainset | 14:29:06 | 17:41:04 | 2:10:36 | 3:33:22 | 2:10:36 | 3:33:21 | | Devset | — | 0:04:32 | — | 0:15:36 | — | 0:15:36 | | Testset | — | 0:11:18 | — | 0:03:48 | — | 0:11:18 |

Trained with David Gaddy’s whole dataset

example_output_0 (1).wav

example_output_1.wav

example_output_2.wav

example_output_3.wav

Trained with David Gaddy’s reduced dataset

example_output_0.wav

example_output_1.wav

example_output_2.wav

example_output_3.wav

Trained with Speaker 001 of ReSSInt dataset

example_output_23.wav

example_output_24.wav

example_output_32.wav

example_output_38.wav