DeepSpeak-v1

DeepSpeak-v1 is an Urdu text-to-speech model built on a Rectified Flow Diffusion Transformer (DiT) architecture. It is trained from scratch on native Urdu speech and uses Qwen3.5-0.8B as the text backbone with a Semantic DACVAE codec for high-quality latent audio.

Note: This model is at 18,000 / 100,000 training steps. Quality improves with continued training.

Model Details

Property	Value
Language	Urdu (اردو)
Task	Text-to-Speech
Architecture	Rectified Flow DiT
Text Backbone	Qwen/Qwen3.5-0.8B
Codec	Semantic DACVAE (32-dim)
Model Dimension	1280
Layers	12
Parameters	~400M
Precision	BF16
Training Steps	18,000
Validation Loss	0.7959

Usage

Install

pip install torch safetensors transformers huggingface_hub soundfile dacvae

Inference

from irodori_tts.inference_runtime import InferenceRuntime, RuntimeKey, SamplingRequest

key = RuntimeKey(
    checkpoint="mahwizzzz/deepspeak-v1",
    model_device="cuda",
    model_precision="bf16",
    codec_device="cuda",
)

runtime = InferenceRuntime.from_key(key)

result = runtime.synthesize(
    SamplingRequest(
        text="یہ ایک آزمائشی جملہ ہے۔",
        ref_wav="reference.wav",
        seconds=10.0,
        num_steps=40,
        cfg_scale_text=3.0,
        cfg_scale_speaker=5.0,
    )
)

import soundfile as sf
sf.write("output.wav", result.audio.squeeze(0).numpy(), result.sample_rate)

Command Line

python infer.py \
  --checkpoint mahwizzzz/deepspeak-v1 \
  --text "یہ ایک آزمائشی جملہ ہے۔" \
  --ref-wav reference.wav \
  --output-wav output.wav \
  --model-device cuda \
  --model-precision bf16

Architecture

Training

Parameter	Value
Optimizer	Muon
Learning Rate	1e-4
LR Scheduler	WSD (Warmup-Stable-Decay)
Warmup Steps	2,000
Effective Batch Size	64
Precision	BF16
Max Sequence Length	256 tokens
Latent Steps	750

Limitations

Speaker reference audio is required for best quality output
Currently supports Urdu only
Model is still in active training (18k / 100k steps)
Best with sentences under 20 words

License

Apache 2.0 — free for research and commercial use. Codec weights: Aratako/Semantic-DACVAE-Japanese-32dim — refer to their license terms.

Citation

@misc{deepspeak2026,
  title     = {DeepSpeak-v1: Urdu Text-to-Speech with Rectified Flow DiT},
  author    = {mahwiz Khalil},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/mahwizzzz/deepspeak-v1}
}

Downloads last month: 101

mahwizzzz
/

deepspeak-v1