DeepSpeak-v1
DeepSpeak-v1 is an Urdu text-to-speech model built on a Rectified Flow Diffusion Transformer (DiT) architecture. It is trained from scratch on native Urdu speech and uses Qwen3.5-0.8B as the text backbone with a Semantic DACVAE codec for high-quality latent audio.
Note: This model is at 18,000 / 100,000 training steps. Quality improves with continued training.
Model Details
| Property | Value |
|---|---|
| Language | Urdu (اردو) |
| Task | Text-to-Speech |
| Architecture | Rectified Flow DiT |
| Text Backbone | Qwen/Qwen3.5-0.8B |
| Codec | Semantic DACVAE (32-dim) |
| Model Dimension | 1280 |
| Layers | 12 |
| Parameters | ~400M |
| Precision | BF16 |
| Training Steps | 18,000 |
| Validation Loss | 0.7959 |
Usage
Install
pip install torch safetensors transformers huggingface_hub soundfile dacvae
Inference
from irodori_tts.inference_runtime import InferenceRuntime, RuntimeKey, SamplingRequest
key = RuntimeKey(
checkpoint="mahwizzzz/deepspeak-v1",
model_device="cuda",
model_precision="bf16",
codec_device="cuda",
)
runtime = InferenceRuntime.from_key(key)
result = runtime.synthesize(
SamplingRequest(
text="یہ ایک آزمائشی جملہ ہے۔",
ref_wav="reference.wav",
seconds=10.0,
num_steps=40,
cfg_scale_text=3.0,
cfg_scale_speaker=5.0,
)
)
import soundfile as sf
sf.write("output.wav", result.audio.squeeze(0).numpy(), result.sample_rate)
Command Line
python infer.py \
--checkpoint mahwizzzz/deepspeak-v1 \
--text "یہ ایک آزمائشی جملہ ہے۔" \
--ref-wav reference.wav \
--output-wav output.wav \
--model-device cuda \
--model-precision bf16
Architecture
Training
| Parameter | Value |
|---|---|
| Optimizer | Muon |
| Learning Rate | 1e-4 |
| LR Scheduler | WSD (Warmup-Stable-Decay) |
| Warmup Steps | 2,000 |
| Effective Batch Size | 64 |
| Precision | BF16 |
| Max Sequence Length | 256 tokens |
| Latent Steps | 750 |
Limitations
- Speaker reference audio is required for best quality output
- Currently supports Urdu only
- Model is still in active training (18k / 100k steps)
- Best with sentences under 20 words
License
Apache 2.0 — free for research and commercial use. Codec weights: Aratako/Semantic-DACVAE-Japanese-32dim — refer to their license terms.
Citation
@misc{deepspeak2026,
title = {DeepSpeak-v1: Urdu Text-to-Speech with Rectified Flow DiT},
author = {mahwiz Khalil},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/mahwizzzz/deepspeak-v1}
}
- Downloads last month
- 101
