VibeVoice-Realtime-0.5B โ€” GGUF

GGUF conversion of microsoft/VibeVoice-Realtime-0.5B for use with CrispASR.

Model variants

File Quant Size Notes
vibevoice-realtime-0.5b-tts-f16.gguf F16 2.0 GB Full precision, reference quality
vibevoice-realtime-0.5b-q8_0.gguf Q8_0 1.1 GB Near-lossless
vibevoice-realtime-0.5b-q4_k.gguf Q4_K 607 MB Recommended โ€” perfect ASR round-trip

Voice prompts

A voice prompt is required for TTS. Each .gguf voice pack ships pre-computed KV caches that establish a fixed speaker identity (the realtime variant is not WAV-cloning โ€” for runtime cloning, use cstr/vibevoice-1.5b-GGUF instead).

This repo bundles all 25 demo voices from microsoft/VibeVoice@main demo/voices/streaming_model/ โ€” same MIT license as the model.

File Speaker Language
vibevoice-voice-emma.gguf Emma (F) English (alias of en-Emma_woman)
vibevoice-voice-en-Emma_woman.gguf Emma (F) English
vibevoice-voice-en-Carter_man.gguf Carter (M) English
vibevoice-voice-en-Davis_man.gguf Davis (M) English
vibevoice-voice-en-Frank_man.gguf Frank (M) English
vibevoice-voice-en-Grace_woman.gguf Grace (F) English
vibevoice-voice-en-Mike_man.gguf Mike (M) English
vibevoice-voice-de-Spk0_man.gguf Spk0 (M) German
vibevoice-voice-de-Spk1_woman.gguf Spk1 (F) German
vibevoice-voice-fr-Spk0_man.gguf Spk0 (M) French
vibevoice-voice-fr-Spk1_woman.gguf Spk1 (F) French
vibevoice-voice-in-Samuel_man.gguf Samuel (M) Indian English
vibevoice-voice-it-Spk0_woman.gguf Spk0 (F) Italian
vibevoice-voice-it-Spk1_man.gguf Spk1 (M) Italian
vibevoice-voice-jp-Spk0_man.gguf Spk0 (M) Japanese
vibevoice-voice-jp-Spk1_woman.gguf Spk1 (F) Japanese
vibevoice-voice-kr-Spk0_woman.gguf Spk0 (F) Korean
vibevoice-voice-kr-Spk1_man.gguf Spk1 (M) Korean
vibevoice-voice-nl-Spk0_man.gguf Spk0 (M) Dutch
vibevoice-voice-nl-Spk1_woman.gguf Spk1 (F) Dutch
vibevoice-voice-pl-Spk0_man.gguf Spk0 (M) Polish
vibevoice-voice-pl-Spk1_woman.gguf Spk1 (F) Polish
vibevoice-voice-pt-Spk0_woman.gguf Spk0 (F) Portuguese
vibevoice-voice-pt-Spk1_man.gguf Spk1 (M) Portuguese
vibevoice-voice-sp-Spk0_woman.gguf Spk0 (F) Spanish
vibevoice-voice-sp-Spk1_man.gguf Spk1 (M) Spanish

Each voice pack is ~2-6 MB. The vibevoice-voice-emma.gguf filename is kept as the legacy default (referenced by crispasr -m auto --backend vibevoice-tts in CrispASR's auto-download manifest); vibevoice-voice-en-Emma_woman.gguf is the canonical upstream-named copy.

Usage

# English with Emma
crispasr --backend vibevoice-tts \
    -m vibevoice-realtime-0.5b-q4_k.gguf \
    --voice vibevoice-voice-emma.gguf \
    --tts "Hello, how are you today?" \
    --tts-output hello.wav

# Japanese with jp-Spk1_woman
crispasr --backend vibevoice-tts \
    -m vibevoice-realtime-0.5b-q4_k.gguf \
    --voice vibevoice-voice-jp-Spk1_woman.gguf \
    --tts "ใ“ใ‚“ใซใกใฏใ€ใ“ใ‚Œใฏๆ—ฅๆœฌ่ชžใฎ้Ÿณๅฃฐใƒ†ใ‚นใƒˆใงใ™ใ€‚" \
    --tts-output jp.wav

Output: 24 kHz mono WAV. Use crispasr -m auto --backend vibevoice-tts to auto-download the model + the default Emma voice.

Architecture

VibeVoice-Realtime-0.5B is a streaming text-to-speech model:

  • Base LM: 4-layer Qwen2 (text encoding with voice context)
  • TTS LM: 20-layer Qwen2 (speech conditioning, autoregressive)
  • Prediction head: 4 AdaLN + SwiGLU layers (flow matching denoiser)
  • DPM-Solver++: 20-step 2nd-order midpoint solver (cosine schedule, v-prediction)
  • Classifier-Free Guidance: dual KV cache, cfg_scale=3.0
  • sigma-VAE decoder: 7-stage transposed ConvNeXt (3200ร— upsample to 24 kHz)
  • EOS classifier: automatic length detection

Quality verification

All quantisations produce exact ASR round-trip matches on English:

Input text Parakeet ASR output
"Hello world" "Hello world."
"Hello, how are you today?" "Hello, how are you today?"
"The quick brown fox jumps over the lazy dog" "The quick brown fox jumps over the lazy dog."
"Good morning everyone" "Good morning, everyone."

Conversion

Model:

python models/convert-vibevoice-to-gguf.py \
    --input microsoft/VibeVoice-Realtime-0.5B \
    --output vibevoice-realtime-0.5b-tts-f16.gguf \
    --include-decoder

build/bin/crispasr-quantize vibevoice-realtime-0.5b-tts-f16.gguf \
    vibevoice-realtime-0.5b-q4_k.gguf q4_k

Voice packs (one per upstream .pt):

python models/convert-vibevoice-voice-to-gguf.py \
    --input demo/voices/streaming_model/en-Carter_man.pt \
    --output vibevoice-voice-en-Carter_man.gguf

Attribution

License

MIT (same as the upstream model and voice prompts).

Downloads last month
560
GGUF
Model size
1B params
Architecture
vibevoice-tts
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/vibevoice-realtime-0.5b-GGUF

Quantized
(5)
this model