Summary

This page presents the results of developing a multimodal silent speech interface that uses both EMG (electromyographic) signals and video as input. In this experiment, we compare the performance of an EMG-to-speech model with a multimodal model that combines EMG and video.

The EMG-to-Speech model takes eight channels of raw EMG signals sampled at 2048 Hz and feeds them into a convolutional encoder based on ResBlocks, producing 768-dimensional feature vectors every 11.6 ms. These vectors are then processed by a Transformer encoder with self-attention and positional encoding, which outputs vectors of the same dimensionality. Finally, the Transformer output is passed to two parallel linear layers: one generates mel-spectrogram frames with 80 frequency bins (the main output), while the other performs phoneme classification as an auxiliary task, using 29 Spanish phonemes plus a silence token. This architecture and the training methodology was described in [1] (Gaddy, D. M.; 2022). The mel-spectrogram is then converted into a waveform using a pretrained, frozen HiFTNet vocoder [2] (Li, Y. A. et al.; 2023), applied offline after training.

The multimodal silent speech interface consists of a two-branch model, where each branch encodes one of the input modalities. The EMG branch uses the same convolutional encoder as the unimodal model, followed by a Branchformer encoder [3] (Peng, Y. et al.; 2022). The video branch uses a convolutional encoder and a Branchformer encoder as described in [4] (Gimeno-Gómez & Martínez-Hinarejos, 2025). To handle the greater complexity of visual features, both modules in the video branch are initialized with pretrained weights. These weights are then allowed to update during the end-to-end training of the full multimodal model. The output of the video Branchformer (256 dimensions) is projected through a linear layer to match the EMG Branchformer output size (768 dimensions). Both representations are then summed and passed to the same two parallel linear layers described above. Since the expected audio frame rate is 86.133 Hz (11.6 ms⁻¹) and the video runs at 30 fps, video frames are aligned with EMG features by repeating frames before being fed into the model. As in the unimodal model, waveform synthesis is performed using the pretrained frozen HiFTNet vocoder.

For this experiment, the model was trained using data from a single speaker. The testset is text-independent (no test utterances share textual content with the trainset) but session-dependent (some of the training utterances were recorded in the same sessions that includes the test samples).

Citations

[1] Gaddy, D. M. (2022). Voicing silent speech (Doctoral dissertation, University of California, Berkeley).
[2] Li, Y. A., Han, C., Jiang, X., & Mesgarani, N. (2023). Hiftnet: A fast high-quality neural vocoder with harmonic-plus-noise filter and inverse short time fourier transform. arXiv preprint arXiv:2309.09493.
[3] Peng, Y., Dalmia, S., Lane, I., & Watanabe, S. (2022, June). Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding. In International Conference on Machine Learning (pp. 17627-17643). PMLR.
[4] Gimeno-Gómez, David, and Carlos D. Martinez-Hinarejos. Tailored design of audio-visual speech recognition models using branchformers. Computer Speech & Language (2025): 101811.

Evaluation

The following metrics have been used for evaluation:

Character Error Rate (CER) and Word Error Rate (WER): to obtain this score, we pass the synthesized audios through an Automatic Speech Recognition model to obtain a transcription, and we obtain the CER and the WER comparing those transcriptions against the Ground Truth text. This measure is better when it is lower.
Distance L1: the L1 distance between the predicted mel spectrogram and the Ground Truth mel spectrogram. Since the evaluation is done with EMG and video from utterances mouthed in silent, the result is aligned with the reference by means of Dynamic Time Warping (DTW) alignment before doing the comparison. This measure is better when it is lower.
Phone Classification: accuracy of the phone classification performed as an auxiliary task. This measure is better when it is higher.
Structural Similarity Index Measure (SSIM): comparison between the predicted mel spectrogram and the reference mel spectrogram (after DTW alignment) using the Structural Similarity Index Measure method. This measure is better when it is lower.

Model	CER (%)	WER (%)	L1 distance	Phone Accuracy (%)	SSIM
Only EMG	39.95	72.08	3.199 $\pm$ 0.276	68.37 $\pm$ 5.84	0.618 $\pm$ 0.032
EMG + Video	17.72	41.67	2.915 $\pm$ 0.248	81.18 $\pm$ 6.90	0.656 $\pm$ 0.026