OmniVoice-bf16 🌍

BF16 quantized version of k2-fsa/OmniVoice.

What is this?

This is a BF16 conversion of OmniVoice — a state-of-the-art zero-shot multilingual TTS model supporting 600+ languages, built on a diffusion language model architecture. Converting from FP32 to BF16 halves the on-disk size and VRAM usage with negligible quality loss, making it the recommended variant for most users.

	Original (FP32)	This (BF16)
Weight dtype	float32	bfloat16
Activation dtype	float32	bfloat16
File size	~Full size	~Half size
VRAM (inference)	Higher	~Halved
Quality	Reference	Virtually identical
Extra dependencies	none	none

Conversion Details

All model weights are converted from float32 to bfloat16. BF16 preserves the same dynamic range as FP32 (8 exponent bits) while halving memory usage, making it the lossless practical choice for inference on modern GPUs.

No post-training quantization, calibration data, or scale factors are required. The model is a direct dtype cast and is fully compatible with the original omnivoice inference code.

Hardware Requirements

GPU: NVIDIA GPU with CUDA support (BF16 natively supported on Ampere and newer; falls back gracefully on older hardware)
CPU: Supported but slow

Usage

This model is a drop-in replacement for k2-fsa/OmniVoice. Simply swap the model ID in any existing OmniVoice workflow.

Usage — ComfyUI (Recommended)

The easiest way to use this model is with ComfyUI-OmniVoice-TTS, which has native support for this BF16 model with zero extra setup.

Installation

Install the ComfyUI node via ComfyUI Manager (search OmniVoice) or manually:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-OmniVoice-TTS.git

The model auto-downloads on first use — select OmniVoice-bf16 from the model dropdown in any OmniVoice node.
Or download manually:

huggingface-cli download drbaph/OmniVoice-bf16 --local-dir ComfyUI/models/omnivoice/OmniVoice-bf16

Recommended Settings

dtype: auto or bf16 — matches this model's native dtype
num_step: 16 (balanced), 32 (higher quality)
keep_model_loaded: True for repeated use

This is the recommended variant for most users — best balance of quality, VRAM usage, and compatibility.

Python API

from omnivoice import OmniVoice
import torch
import torchaudio

model = OmniVoice.from_pretrained(
    "drbaph/OmniVoice-bf16",
    device_map="cuda:0",
    dtype=torch.bfloat16  # matches native dtype of this model
)

# Voice Cloning
audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
)

torchaudio.save("out.wav", audio[0], 24000)

Set dtype=torch.bfloat16 or dtype="auto" to match this model's native dtype and avoid any unnecessary casting overhead.

Voice Design

audio = model.generate(
    text="Hello, this is a test of zero-shot voice design.",
    instruct="female, low pitch, british accent",
)

Auto Voice

audio = model.generate(text="This is a sentence without any voice prompt.")

Recommended Settings

dtype: auto or bf16 — matches this model's native dtype
num_step: 16 (balanced), 32 (higher quality)
speed: 1.0 (default)

For the full Python API reference, generation parameters, non-verbal symbols, pronunciation control, and batch inference, see the original model card.

About OmniVoice

OmniVoice is a state-of-the-art zero-shot multilingual TTS model from k2-fsa supporting 600+ languages. Built on a novel diffusion language model architecture, it generates high-quality speech with superior inference speed (RTF as low as 0.025 — 40× faster than real-time), supporting voice cloning and voice design.

Key features: 600+ languages, zero-shot voice cloning, voice design (gender, age, pitch, accent, dialect, etc.), and fast diffusion-based inference.

License

This model inherits the Apache 2.0 License from k2-fsa/OmniVoice.

The BF16 conversion was produced by drbaph and is released under the same license.

Citation

@article{zhu2026omnivoice,
      title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
      author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
      journal={arXiv preprint arXiv:2604.00688},
      year={2026}
}

Downloads last month: 3,005

Model tree for drbaph/OmniVoice-bf16

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

k2-fsa/OmniVoice

Finetuned

(4)

this model

Space using drbaph/OmniVoice-bf16 1

Paper for drbaph/OmniVoice-bf16

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

Paper • 2604.00688 • Published 8 days ago • 7