Svara-TTS v1 — MLX 8-bit

Parent model: kenpath/svara-tts-v1 — full upstream weights, model card, training data, and evaluation. All credit for the model itself goes to the Kenpath team. This repo only contains an MLX-format quantization for inference on Apple Silicon.

Orpheus base: canopylabs/3b-hi-ft-research_release — Canopy Labs' Orpheus Hindi research release, which Svara was fine-tuned from.

8-bit MLX-quantized port of kenpath/svara-tts-v1 — an autoregressive multilingual text-to-speech model for 19 Indian languages, in the Orpheus / SNAC family. Quantized at ~8.5 bits per weight (q-bits=8, q-group-size=64), down from 13.2 GB bf16 to **3.5 GB**. Use this variant when you want quality closer to bf16 with a smaller memory footprint.

Built for mlx-audio on Apple Silicon.

Usage

Requires mlx-audio with TTS extras:

pip install "mlx-audio[tts]"

Python

import numpy as np
import soundfile as sf
import mlx.core as mx
from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/svara-tts-v1-8bit")

chunks = []
for result in model.generate(
    text="नमस्ते, आप कैसे हैं? मैं ठीक हूँ।",
    voice="Hindi (Female)",
    temperature=0.75,
    top_p=0.9,
    top_k=40,
    repetition_penalty=1.1,
    max_tokens=1200,
):
    chunks.append(result.audio)

audio = mx.concatenate(chunks, axis=0)
sf.write("hello_hi.wav", np.asarray(audio), model.sample_rate)  # 24 kHz

CLI

mlx_audio.tts.generate \
    --model mlx-community/svara-tts-v1-8bit \
    --text "नमस्ते, आप कैसे हैं?" \
    --voice "Hindi (Female)" \
    --temperature 0.75 \
    --top_p 0.9

Voices

Use a string of the form "<Language Name> (<Gender>)":

Language	Voices
Hindi	`Hindi (Male)`, `Hindi (Female)`
Bengali	`Bengali (Male)`, `Bengali (Female)`
Marathi	`Marathi (Male)`, `Marathi (Female)`
Telugu	`Telugu (Male)`, `Telugu (Female)`
Kannada	`Kannada (Male)`, `Kannada (Female)`
Tamil	`Tamil (Male)`, `Tamil (Female)`
Malayalam	`Malayalam (Male)`, `Malayalam (Female)`
Gujarati	`Gujarati (Male)`, `Gujarati (Female)`
Punjabi	`Punjabi (Male)`, `Punjabi (Female)`
Assamese	`Assamese (Male)`, `Assamese (Female)`
Bhojpuri	`Bhojpuri (Male)`, `Bhojpuri (Female)`
Magahi	`Magahi (Male)`, `Magahi (Female)`
Maithili	`Maithili (Male)`, `Maithili (Female)`
Chhattisgarhi	`Chhattisgarhi (Male)`, `Chhattisgarhi (Female)`
Bodo	`Bodo (Male)`, `Bodo (Female)`
Dogri	`Dogri (Male)`, `Dogri (Female)`
Nepali	`Nepali (Male)`, `Nepali (Female)`
Sanskrit	`Sanskrit (Male)`, `Sanskrit (Female)`
English (Indian)	`English (Indian) (Male)`, `English (Indian) (Female)`

Total: 38 voices across 19 languages.

Sampling Recommendations

The upstream svara-tts-inference repo uses these defaults; they're a good starting point:

Parameter	Value
`temperature`	0.75
`top_p`	0.9
`top_k`	40
`repetition_penalty`	1.1
`max_tokens`	1200–2048

Architecture

Backbone: Llama-3.2-3B (fine-tuned from canopylabs/3b-hi-ft-research_release, Canopy's Orpheus Hindi base).
Codec: SNAC 24 kHz — 3-level hierarchical RVQ, 7 codes per ~10 ms frame. Loaded automatically by mlx-audio.
Output: 24 kHz mono PCM.