Noisy Whisper Small for Romanian AVSR

This is the audio backbone used in the Audio-Visual Speech Recognition (AVSR) pipeline introduced in the paper VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness.

It is a fine-tuned version of alexandradiaconu/whisper-small-echo-34, adapted to handle noisy acoustic conditions in Romanian. Training data was augmented with noise samples from the MUSAN corpus.

The model is paired with our Romanian VSR models through shallow fusion at decoding time, combining acoustic and visual probabilities during beam search.

For the full AVSR pipeline, fusion implementation, and inference scripts, please refer to the GitHub repository.

Results

Evaluated on a 100-clip subset of the VSRo-200 test_unseen split, under two noise types (Gaussian and Babble) and varying signal-to-noise ratios. All values are WER (%); lower is better.

Gaussian noise

SNR (dB) Whisper zero-shot (%) Whisper fine-tuned (%) MultiVSR (visual) (%) Fusion (zero-shot + VSR) (%) Fusion (fine-tuned + VSR) (%)
-5 90.11 73.49 47.80 80.26 38.87
0 69.54 40.99 47.80 51.67 26.76
5 47.40 24.40 47.80 34.72 17.63
10 33.68 15.69 47.80 23.43 14.55
15 25.73 13.08 47.80 18.77 12.11

Babble noise

SNR (dB) Whisper zero-shot (%) Whisper fine-tuned (%) MultiVSR (visual) (%) Fusion (zero-shot + VSR) (%) Fusion (fine-tuned + VSR) (%)
-5 93.91 137.05 47.80 83.48 38.59
0 73.09 50.23 47.80 56.36 34.90
5 45.86 19.92 47.80 30.85 18.85
10 28.63 14.19 47.80 23.04 13.36
15 22.36 12.11 47.80 17.31 12.11

At extreme noise levels (e.g., babble at -5 dB), the standalone fine-tuned audio model collapses (137.05% WER). Shallow fusion with the visual stream forces the decoder to rely on lip-reading cues and recovers performance to 38.59% WER, demonstrating the value of multimodal integration in adverse acoustic conditions.

Training hyperparameters

Parameter Value
Learning rate 1e-05
Train batch size 8
Eval batch size 8
Gradient accumulation steps 4
Effective batch size 32
Optimizer AdamW (torch fused), β=(0.9, 0.999), ε=1e-08
LR scheduler Linear, 100 warmup steps
Epochs 3
Mixed precision Native AMP
Seed 42

Framework versions

Transformers 5.0.0 · PyTorch 2.10.0+cu128 · Datasets 4.0.0 · Tokenizers 0.22.2

Citation

If you use this model, please cite:

@inproceedings{vsro200,
  title  = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
  author = {...},
  year   = {2026}
}
@article{diaconu2026ron3ws,
  title={RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks},
  author={Diaconu, Alexandra and Vînaga, Mădălina and Alexe, Bogdan},
  journal={arXiv preprint arXiv:2603.02368},
  year={2026}
}
Downloads last month
85
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vsro200/whisper-small-vsro200

Finetuned
(2)
this model

Dataset used to train vsro200/whisper-small-vsro200

Paper for vsro200/whisper-small-vsro200