Noisy Whisper Small for Romanian AVSR
This is the audio backbone used in the Audio-Visual Speech Recognition (AVSR) pipeline introduced in the paper VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness.
It is a fine-tuned version of alexandradiaconu/whisper-small-echo-34, adapted to handle noisy acoustic conditions in Romanian. Training data was augmented with noise samples from the MUSAN corpus.
The model is paired with our Romanian VSR models through shallow fusion at decoding time, combining acoustic and visual probabilities during beam search.
For the full AVSR pipeline, fusion implementation, and inference scripts, please refer to the GitHub repository.
Results
Evaluated on a 100-clip subset of the VSRo-200 test_unseen split, under two noise types (Gaussian and Babble) and varying signal-to-noise ratios. All values are WER (%); lower is better.
Gaussian noise
| SNR (dB) | Whisper zero-shot (%) | Whisper fine-tuned (%) | MultiVSR (visual) (%) | Fusion (zero-shot + VSR) (%) | Fusion (fine-tuned + VSR) (%) |
|---|---|---|---|---|---|
| -5 | 90.11 | 73.49 | 47.80 | 80.26 | 38.87 |
| 0 | 69.54 | 40.99 | 47.80 | 51.67 | 26.76 |
| 5 | 47.40 | 24.40 | 47.80 | 34.72 | 17.63 |
| 10 | 33.68 | 15.69 | 47.80 | 23.43 | 14.55 |
| 15 | 25.73 | 13.08 | 47.80 | 18.77 | 12.11 |
Babble noise
| SNR (dB) | Whisper zero-shot (%) | Whisper fine-tuned (%) | MultiVSR (visual) (%) | Fusion (zero-shot + VSR) (%) | Fusion (fine-tuned + VSR) (%) |
|---|---|---|---|---|---|
| -5 | 93.91 | 137.05 | 47.80 | 83.48 | 38.59 |
| 0 | 73.09 | 50.23 | 47.80 | 56.36 | 34.90 |
| 5 | 45.86 | 19.92 | 47.80 | 30.85 | 18.85 |
| 10 | 28.63 | 14.19 | 47.80 | 23.04 | 13.36 |
| 15 | 22.36 | 12.11 | 47.80 | 17.31 | 12.11 |
At extreme noise levels (e.g., babble at -5 dB), the standalone fine-tuned audio model collapses (137.05% WER). Shallow fusion with the visual stream forces the decoder to rely on lip-reading cues and recovers performance to 38.59% WER, demonstrating the value of multimodal integration in adverse acoustic conditions.
Training hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 1e-05 |
| Train batch size | 8 |
| Eval batch size | 8 |
| Gradient accumulation steps | 4 |
| Effective batch size | 32 |
| Optimizer | AdamW (torch fused), β=(0.9, 0.999), ε=1e-08 |
| LR scheduler | Linear, 100 warmup steps |
| Epochs | 3 |
| Mixed precision | Native AMP |
| Seed | 42 |
Framework versions
Transformers 5.0.0 · PyTorch 2.10.0+cu128 · Datasets 4.0.0 · Tokenizers 0.22.2
Citation
If you use this model, please cite:
@inproceedings{vsro200,
title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
author = {...},
year = {2026}
}
@article{diaconu2026ron3ws,
title={RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks},
author={Diaconu, Alexandra and Vînaga, Mădălina and Alexe, Bogdan},
journal={arXiv preprint arXiv:2603.02368},
year={2026}
}
- Downloads last month
- 85
Model tree for vsro200/whisper-small-vsro200
Base model
alexandradiaconu/whisper-small-echo-34