OmniVoice-bf16 🌍

BF16 quantized version of k2-fsa/OmniVoice.

Screenshot 2026-04-02 203949

Original Model | Paper | GitHub (Original) | HuggingFace Space | Demo Page | ComfyUI Node


What is this?

This is a BF16 conversion of OmniVoice — a state-of-the-art zero-shot multilingual TTS model supporting 600+ languages, built on a diffusion language model architecture. Converting from FP32 to BF16 halves the on-disk size and VRAM usage with negligible quality loss, making it the recommended variant for most users.

Original (FP32) This (BF16)
Weight dtype float32 bfloat16
Activation dtype float32 bfloat16
File size ~Full size ~Half size
VRAM (inference) Higher ~Halved
Quality Reference Virtually identical
Extra dependencies none none

Conversion Details

All model weights are converted from float32 to bfloat16. BF16 preserves the same dynamic range as FP32 (8 exponent bits) while halving memory usage, making it the lossless practical choice for inference on modern GPUs.

No post-training quantization, calibration data, or scale factors are required. The model is a direct dtype cast and is fully compatible with the original omnivoice inference code.


Hardware Requirements

  • GPU: NVIDIA GPU with CUDA support (BF16 natively supported on Ampere and newer; falls back gracefully on older hardware)
  • CPU: Supported but slow

Usage

This model is a drop-in replacement for k2-fsa/OmniVoice. Simply swap the model ID in any existing OmniVoice workflow.


Usage — ComfyUI (Recommended)

The easiest way to use this model is with ComfyUI-OmniVoice-TTS, which has native support for this BF16 model with zero extra setup.

Installation

  1. Install the ComfyUI node via ComfyUI Manager (search OmniVoice) or manually:
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-OmniVoice-TTS.git
  1. The model auto-downloads on first use — select OmniVoice-bf16 from the model dropdown in any OmniVoice node.

  2. Or download manually:

huggingface-cli download drbaph/OmniVoice-bf16 --local-dir ComfyUI/models/omnivoice/OmniVoice-bf16

Recommended Settings

  • dtype: auto or bf16 — matches this model's native dtype
  • num_step: 16 (balanced), 32 (higher quality)
  • keep_model_loaded: True for repeated use

This is the recommended variant for most users — best balance of quality, VRAM usage, and compatibility.


Python API

from omnivoice import OmniVoice
import torch
import torchaudio

model = OmniVoice.from_pretrained(
    "drbaph/OmniVoice-bf16",
    device_map="cuda:0",
    dtype=torch.bfloat16  # matches native dtype of this model
)

# Voice Cloning
audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
)

torchaudio.save("out.wav", audio[0], 24000)

Set dtype=torch.bfloat16 or dtype="auto" to match this model's native dtype and avoid any unnecessary casting overhead.

Voice Design

audio = model.generate(
    text="Hello, this is a test of zero-shot voice design.",
    instruct="female, low pitch, british accent",
)

Auto Voice

audio = model.generate(text="This is a sentence without any voice prompt.")

Recommended Settings

  • dtype: auto or bf16 — matches this model's native dtype
  • num_step: 16 (balanced), 32 (higher quality)
  • speed: 1.0 (default)

For the full Python API reference, generation parameters, non-verbal symbols, pronunciation control, and batch inference, see the original model card.


About OmniVoice

OmniVoice is a state-of-the-art zero-shot multilingual TTS model from k2-fsa supporting 600+ languages. Built on a novel diffusion language model architecture, it generates high-quality speech with superior inference speed (RTF as low as 0.025 — 40× faster than real-time), supporting voice cloning and voice design.

Key features: 600+ languages, zero-shot voice cloning, voice design (gender, age, pitch, accent, dialect, etc.), and fast diffusion-based inference.


License

This model inherits the Apache 2.0 License from k2-fsa/OmniVoice.

The BF16 conversion was produced by drbaph and is released under the same license.


Citation

@article{zhu2026omnivoice,
      title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
      author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
      journal={arXiv preprint arXiv:2604.00688},
      year={2026}
}
Downloads last month
3,005
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for drbaph/OmniVoice-bf16

Finetuned
Qwen/Qwen3-0.6B
Finetuned
k2-fsa/OmniVoice
Finetuned
(4)
this model

Space using drbaph/OmniVoice-bf16 1

Paper for drbaph/OmniVoice-bf16