DeepSpeak-v1

DeepSpeak-v1 is an Urdu text-to-speech model built on a Rectified Flow Diffusion Transformer (DiT) architecture. It is trained from scratch on native Urdu speech and uses Qwen3.5-0.8B as the text backbone with a Semantic DACVAE codec for high-quality latent audio.

Note: This model is at 18,000 / 100,000 training steps. Quality improves with continued training.


Model Details

Property Value
Language Urdu (اردو)
Task Text-to-Speech
Architecture Rectified Flow DiT
Text Backbone Qwen/Qwen3.5-0.8B
Codec Semantic DACVAE (32-dim)
Model Dimension 1280
Layers 12
Parameters ~400M
Precision BF16
Training Steps 18,000
Validation Loss 0.7959

Usage

Install

pip install torch safetensors transformers huggingface_hub soundfile dacvae

Inference

from irodori_tts.inference_runtime import InferenceRuntime, RuntimeKey, SamplingRequest

key = RuntimeKey(
    checkpoint="mahwizzzz/deepspeak-v1",
    model_device="cuda",
    model_precision="bf16",
    codec_device="cuda",
)

runtime = InferenceRuntime.from_key(key)

result = runtime.synthesize(
    SamplingRequest(
        text="یہ ایک آزمائشی جملہ ہے۔",
        ref_wav="reference.wav",
        seconds=10.0,
        num_steps=40,
        cfg_scale_text=3.0,
        cfg_scale_speaker=5.0,
    )
)

import soundfile as sf
sf.write("output.wav", result.audio.squeeze(0).numpy(), result.sample_rate)

Command Line

python infer.py \
  --checkpoint mahwizzzz/deepspeak-v1 \
  --text "یہ ایک آزمائشی جملہ ہے۔" \
  --ref-wav reference.wav \
  --output-wav output.wav \
  --model-device cuda \
  --model-precision bf16

Architecture

architecture


Training

Parameter Value
Optimizer Muon
Learning Rate 1e-4
LR Scheduler WSD (Warmup-Stable-Decay)
Warmup Steps 2,000
Effective Batch Size 64
Precision BF16
Max Sequence Length 256 tokens
Latent Steps 750

Limitations

  • Speaker reference audio is required for best quality output
  • Currently supports Urdu only
  • Model is still in active training (18k / 100k steps)
  • Best with sentences under 20 words

License

Apache 2.0 — free for research and commercial use. Codec weights: Aratako/Semantic-DACVAE-Japanese-32dim — refer to their license terms.


Citation

@misc{deepspeak2026,
  title     = {DeepSpeak-v1: Urdu Text-to-Speech with Rectified Flow DiT},
  author    = {mahwiz Khalil},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/mahwizzzz/deepspeak-v1}
}
Downloads last month
101
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using mahwizzzz/deepspeak-v1 1