Dr. David Gaddy, from the University of Berkeley, has made significant achievements in EMG-to-Speech synthesis, which we are using as a benchmark. The key distinction between his dataset and ours is the nature of the recordings. He worked with a single-speaker dataset spanning approximately 11 hours, whereas ours comprises multiple speakers with less data per speaker.
To effectively compare the outcomes achievable using his English dataset against ours in a single-speaker context, we needed to make the data volumes equal. This involved reducing his dataset to match the volume of data available for Speaker 001 in our dataset, which is the one with the most data. Subsequently, we trained a model for each dataset using the same architecture to evaluate the quality of results.
| | Original dataset | | Reference dataset (Speaker 001) | | David Gaddy’s reduced dataset | | | --- | --- | --- | --- | --- | --- | --- | | | Audio duration | EMG duration | Audio duration | EMG duration | Audio duration | EMG duration | | Trainset | 14:29:06 | 17:41:04 | 2:10:36 | 3:33:22 | 2:10:36 | 3:33:21 | | Devset | — | 0:04:32 | — | 0:15:36 | — | 0:15:36 | | Testset | — | 0:11:18 | — | 0:03:48 | — | 0:11:18 |
Trained with David Gaddy’s whole dataset
Trained with David Gaddy’s reduced dataset
Trained with Speaker 001 of ReSSInt dataset