Noisy Whisper Small for Romanian AVSR

This is the audio backbone used in the Audio-Visual Speech Recognition (AVSR) pipeline introduced in the paper VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness.

It is a fine-tuned version of alexandradiaconu/whisper-small-echo-34, adapted to handle noisy acoustic conditions in Romanian. Training data was augmented with noise samples from the MUSAN corpus.

The model is paired with our Romanian VSR models through shallow fusion at decoding time, combining acoustic and visual probabilities during beam search.

For the full AVSR pipeline, fusion implementation, and inference scripts, please refer to the GitHub repository.

Results

Evaluated on a 100-clip subset of the VSRo-200 test_unseen split, under two noise types (Gaussian and Babble) and varying signal-to-noise ratios. All values are WER (%); lower is better.

Gaussian noise

SNR (dB)	Whisper zero-shot (%)	Whisper fine-tuned (%)	MultiVSR (visual) (%)	Fusion (zero-shot + VSR) (%)	Fusion (fine-tuned + VSR) (%)
-5	90.11	73.49	47.80	80.26	38.87
0	69.54	40.99	47.80	51.67	26.76
5	47.40	24.40	47.80	34.72	17.63
10	33.68	15.69	47.80	23.43	14.55
15	25.73	13.08	47.80	18.77	12.11

Babble noise

SNR (dB)	Whisper zero-shot (%)	Whisper fine-tuned (%)	MultiVSR (visual) (%)	Fusion (zero-shot + VSR) (%)	Fusion (fine-tuned + VSR) (%)
-5	93.91	137.05	47.80	83.48	38.59
0	73.09	50.23	47.80	56.36	34.90
5	45.86	19.92	47.80	30.85	18.85
10	28.63	14.19	47.80	23.04	13.36
15	22.36	12.11	47.80	17.31	12.11

At extreme noise levels (e.g., babble at -5 dB), the standalone fine-tuned audio model collapses (137.05% WER). Shallow fusion with the visual stream forces the decoder to rely on lip-reading cues and recovers performance to 38.59% WER, demonstrating the value of multimodal integration in adverse acoustic conditions.

Training hyperparameters

Parameter	Value
Learning rate	1e-05
Train batch size	8
Eval batch size	8
Gradient accumulation steps	4
Effective batch size	32
Optimizer	AdamW (torch fused), β=(0.9, 0.999), ε=1e-08
LR scheduler	Linear, 100 warmup steps
Epochs	3
Mixed precision	Native AMP
Seed	42

Framework versions

Transformers 5.0.0 · PyTorch 2.10.0+cu128 · Datasets 4.0.0 · Tokenizers 0.22.2

Citation

If you use this model, please cite:

@inproceedings{vsro200,
  title  = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
  author = {...},
  year   = {2026}
}

@article{diaconu2026ron3ws,
  title={RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks},
  author={Diaconu, Alexandra and Vînaga, Mădălina and Alexe, Bogdan},
  journal={arXiv preprint arXiv:2603.02368},
  year={2026}
}

Downloads last month: 85

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for vsro200/whisper-small-vsro200

Base model

alexandradiaconu/whisper-small-echo-34

Finetuned

(2)

this model

Dataset used to train vsro200/whisper-small-vsro200

Paper for vsro200/whisper-small-vsro200

RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks

Paper • 2603.02368 • Published Mar 2