Summary

This page presents the results of developing a multimodal silent speech interface that uses both EMG (electromyographic) signals and video as input. In this experiment, we compare the performance of an EMG-to-speech model with a multimodal model that combines EMG and video.

The EMG-to-Speech model takes eight channels of raw EMG signals sampled at 2048 Hz and feeds them into a convolutional encoder based on ResBlocks, producing 768-dimensional feature vectors every 11.6 ms. These vectors are then processed by a Transformer encoder with self-attention and positional encoding, which outputs vectors of the same dimensionality. Finally, the Transformer output is passed to two parallel linear layers: one generates mel-spectrogram frames with 80 frequency bins (the main output), while the other performs phoneme classification as an auxiliary task, using 29 Spanish phonemes plus a silence token. This architecture and the training methodology was described in [1] (Gaddy, D. M.; 2022). The mel-spectrogram is then converted into a waveform using a pretrained, frozen HiFTNet vocoder [2] (Li, Y. A. et al.; 2023), applied offline after training.

path10-2.png

The multimodal silent speech interface consists of a two-branch model, where each branch encodes one of the input modalities. The EMG branch uses the same convolutional encoder as the unimodal model, followed by a Branchformer encoder [3] (Peng, Y. et al.; 2022). The video branch uses a convolutional encoder and a Branchformer encoder as described in [4] (Gimeno-Gómez & Martínez-Hinarejos, 2025). To handle the greater complexity of visual features, both modules in the video branch are initialized with pretrained weights. These weights are then allowed to update during the end-to-end training of the full multimodal model. The output of the video Branchformer (256 dimensions) is projected through a linear layer to match the EMG Branchformer output size (768 dimensions). Both representations are then summed and passed to the same two parallel linear layers described above. Since the expected audio frame rate is 86.133 Hz (11.6 ms⁻¹) and the video runs at 30 fps, video frames are aligned with EMG features by repeating frames before being fed into the model. As in the unimodal model, waveform synthesis is performed using the pretrained frozen HiFTNet vocoder.

g13.png

For this experiment, the model was trained using data from a single speaker. The testset is text-independent (no test utterances share textual content with the trainset) but session-dependent (some of the training utterances were recorded in the same sessions that includes the test samples).

Citations

Evaluation

The following metrics have been used for evaluation:

Model CER (%) WER (%) L1 distance Phone Accuracy (%) SSIM
Only EMG 39.95 72.08 3.199 $\pm$ 0.276 68.37 $\pm$ 5.84 0.618 $\pm$ 0.032
EMG + Video 17.72 41.67 2.915 $\pm$ 0.248 81.18 $\pm$ 6.90 0.656 $\pm$ 0.026