You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

AIT-Syn 4L — Multilingual TTS with Voice Cloning

A multilingual text-to-speech model supporting Kazakh, Russian, English, and Uzbek with cross-lingual voice cloning. Fine-tuned from Qwen3-TTS-12Hz-1.7B-Base.

Features

  • 4 languages: Kazakh (kk), Russian (ru), English (en), Uzbek (uz)
  • Voice cloning: clone any voice from a short reference audio (~5–10 s)
  • Two cloning modes: x-vector-only (no transcript needed) or ICL (with ref transcript, higher quality)
  • 12.5 Hz codec: efficient autoregressive generation
  • 24 kHz output: PCM 16-bit WAV

Quick Start

Installation

pip install qwen-tts torch soundfile

Generate Speech

from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel
import soundfile as sf

model = Qwen3TTSModel.from_pretrained(
    "nur-dev/ait-syn-4L",
    dtype="bfloat16",
    device_map="cuda:0",
)
model.model.eval()

# X-vector-only mode (no ref transcript needed)
wavs, sr = model.generate_voice_clone(
    text="Сәлеметсіз бе, бұл сынақ сөйлем.",
    language="kazakh",
    ref_audio="ref_audio_kk.wav",
    x_vector_only_mode=True,
    non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr)

# ICL mode (provide ref transcript for better quality)
wavs, sr = model.generate_voice_clone(
    text="Привет, это тестовое предложение.",
    language="russian",
    ref_audio="ref_audio_kk.wav",
    ref_text="Бұл анықтамалық аудио.",
    x_vector_only_mode=False,
    non_streaming_mode=True,
)
sf.write("output_icl.wav", wavs[0], sr)

API Reference

generate_voice_clone()

Parameter Type Default Description
text str or list[str] required Text to synthesize
language str required Language name: kazakh, russian, english, uzbek
ref_audio str or (ndarray, sr) required Reference audio: file path, URL, base64, or (waveform, sample_rate)
ref_text str or None None Transcript of ref audio (enables ICL mode)
x_vector_only_mode bool False If True, use only x-vector speaker embedding (no ICL)
non_streaming_mode bool False If True, return complete audio; if False, return generator
temperature float 0.9 Sampling temperature
top_k int 50 Top-k sampling
top_p float 1.0 Nucleus sampling threshold
repetition_penalty float 1.05 Repetition penalty

Returns: (list[np.ndarray], int) — list of waveforms and sample rate (24000).

Voice Cloning Modes

X-vector-only (x_vector_only_mode=True)

Uses only the speaker embedding extracted from reference audio. No transcript of the reference is needed. Good for quick cloning when you don't have a transcript.

ICL Mode (x_vector_only_mode=False, provide ref_text)

In-context learning mode: the model sees both the reference audio and its transcript, producing higher-fidelity voice matching. Recommended when a transcript is available.

Serving

A FastAPI server is available for production deployment:

pip install fastapi uvicorn python-multipart soundfile

# Start server
python serve_tts.py --model nur-dev/ait-syn-4L --port 8000

# Or with uvicorn directly
CUDA_VISIBLE_DEVICES=0 TTS_MODEL_PATH=nur-dev/ait-syn-4L uvicorn serve_tts:app --host 0.0.0.0 --port 8000

API Endpoints

Endpoint Method Description
/tts POST Synthesize speech (returns WAV)
/tts/batch POST Batch synthesis (returns ZIP of WAVs)
/health GET Health check
/languages GET List supported languages

Example Request

curl -X POST http://localhost:8000/tts \
  -F "text=Сәлеметсіз бе" \
  -F "language=kk" \
  -F "ref_audio=@ref_audio_kk.wav" \
  --output output.wav

Technical Specs

Spec Value
Parameters 1.7B
Architecture Qwen3TTSForConditionalGeneration
Codec rate 12.5 Hz (16 sub-codecs)
Output sample rate 24 kHz
Precision bf16
Max generation length 8192 tokens (~10 min audio)

Reference Audio

A sample Kazakh male reference audio is included as ref_audio_kk.wav (mono, 24 kHz, ~10 s).

License

CC-BY-NC-4.0

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support