OmniVoice-bf16 🌍
BF16 quantized version of k2-fsa/OmniVoice.
Original Model | Paper | GitHub (Original) | HuggingFace Space | Demo Page | ComfyUI Node
What is this?
This is a BF16 conversion of OmniVoice — a state-of-the-art zero-shot multilingual TTS model supporting 600+ languages, built on a diffusion language model architecture. Converting from FP32 to BF16 halves the on-disk size and VRAM usage with negligible quality loss, making it the recommended variant for most users.
| Original (FP32) | This (BF16) | |
|---|---|---|
| Weight dtype | float32 | bfloat16 |
| Activation dtype | float32 | bfloat16 |
| File size | ~Full size | ~Half size |
| VRAM (inference) | Higher | ~Halved |
| Quality | Reference | Virtually identical |
| Extra dependencies | none | none |
Conversion Details
All model weights are converted from float32 to bfloat16. BF16 preserves the same dynamic range as FP32 (8 exponent bits) while halving memory usage, making it the lossless practical choice for inference on modern GPUs.
No post-training quantization, calibration data, or scale factors are required. The model is a direct dtype cast and is fully compatible with the original omnivoice inference code.
Hardware Requirements
- GPU: NVIDIA GPU with CUDA support (BF16 natively supported on Ampere and newer; falls back gracefully on older hardware)
- CPU: Supported but slow
Usage
This model is a drop-in replacement for k2-fsa/OmniVoice. Simply swap the model ID in any existing OmniVoice workflow.
Usage — ComfyUI (Recommended)
The easiest way to use this model is with ComfyUI-OmniVoice-TTS, which has native support for this BF16 model with zero extra setup.
Installation
- Install the ComfyUI node via ComfyUI Manager (search
OmniVoice) or manually:
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-OmniVoice-TTS.git
The model auto-downloads on first use — select
OmniVoice-bf16from the model dropdown in any OmniVoice node.Or download manually:
huggingface-cli download drbaph/OmniVoice-bf16 --local-dir ComfyUI/models/omnivoice/OmniVoice-bf16
Recommended Settings
dtype:autoorbf16— matches this model's native dtypenum_step:16(balanced),32(higher quality)keep_model_loaded:Truefor repeated use
This is the recommended variant for most users — best balance of quality, VRAM usage, and compatibility.
Python API
from omnivoice import OmniVoice
import torch
import torchaudio
model = OmniVoice.from_pretrained(
"drbaph/OmniVoice-bf16",
device_map="cuda:0",
dtype=torch.bfloat16 # matches native dtype of this model
)
# Voice Cloning
audio = model.generate(
text="Hello, this is a test of zero-shot voice cloning.",
ref_audio="ref.wav",
ref_text="Transcription of the reference audio.",
)
torchaudio.save("out.wav", audio[0], 24000)
Set
dtype=torch.bfloat16ordtype="auto"to match this model's native dtype and avoid any unnecessary casting overhead.
Voice Design
audio = model.generate(
text="Hello, this is a test of zero-shot voice design.",
instruct="female, low pitch, british accent",
)
Auto Voice
audio = model.generate(text="This is a sentence without any voice prompt.")
Recommended Settings
dtype:autoorbf16— matches this model's native dtypenum_step:16(balanced),32(higher quality)speed:1.0(default)
For the full Python API reference, generation parameters, non-verbal symbols, pronunciation control, and batch inference, see the original model card.
About OmniVoice
OmniVoice is a state-of-the-art zero-shot multilingual TTS model from k2-fsa supporting 600+ languages. Built on a novel diffusion language model architecture, it generates high-quality speech with superior inference speed (RTF as low as 0.025 — 40× faster than real-time), supporting voice cloning and voice design.
Key features: 600+ languages, zero-shot voice cloning, voice design (gender, age, pitch, accent, dialect, etc.), and fast diffusion-based inference.
License
This model inherits the Apache 2.0 License from k2-fsa/OmniVoice.
The BF16 conversion was produced by drbaph and is released under the same license.
Citation
@article{zhu2026omnivoice,
title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
journal={arXiv preprint arXiv:2604.00688},
year={2026}
}
- Downloads last month
- 3,005
