Chatterbox Arabic Fine-tuned TTS 🇸🇦

This is an Arabic-focused fine-tuned version of ResembleAI/chatterbox multilingual model using LoRA (Low-Rank Adaptation).

🎯 Model Description

This model has been specifically fine-tuned to improve Arabic language text-to-speech synthesis quality, including:

Enhanced Arabic pronunciation and phonetics
Better handling of Arabic diacritics (تشكيل)
Improved intonation for Arabic speech patterns
Support for Modern Standard Arabic (MSA) and common dialects
Natural-sounding Arabic voice generation

✨ Key Features

🗣️ High-quality Arabic speech synthesis
🎭 Zero-shot voice cloning for Arabic speakers
⚡ Fast inference (real-time capable)
🎚️ Emotion/expression control
🔄 Supports both Arabic and English (bilingual)

📦 Installation

pip install chatterbox-tts torch torchaudio huggingface_hub

🚀 Quick Start

Basic Arabic TTS

import torch
import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load base multilingual model
model = ChatterboxMultilingualTTS.from_pretrained(device=device)

# Download and apply Arabic fine-tuned weights
t3_path = hf_hub_download(
    repo_id="YOUR-USERNAME/chatterbox-arabic-finetuned",
    filename="t3_cfg.pt"
)
t3_state = torch.load(t3_path, map_location="cpu")
model.t3.load_state_dict(t3_state)

# Generate Arabic speech
arabic_text = "مرحباً بك في نموذج تحويل النص إلى كلام المحسّن للغة العربية"
wav = model.generate(arabic_text, language_id="ar")
ta.save("arabic_output.wav", wav, model.sr)

Load All Fine-tuned Components

import torch
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load base model
model = ChatterboxMultilingualTTS.from_pretrained(device=device)

# Download all fine-tuned weights
repo_id = "YOUR-USERNAME/chatterbox-arabic-finetuned"

t3_path = hf_hub_download(repo_id=repo_id, filename="t3_cfg.pt")
conds_path = hf_hub_download(repo_id=repo_id, filename="conds.pt")
s3gen_path = hf_hub_download(repo_id=repo_id, filename="s3gen.pt")
ve_path = hf_hub_download(repo_id=repo_id, filename="ve.pt")

# Load all components
model.t3.load_state_dict(torch.load(t3_path, map_location="cpu"))
model.conds.load_state_dict(torch.load(conds_path, map_location="cpu"))
model.s3gen.load_state_dict(torch.load(s3gen_path, map_location="cpu"))
model.ve.load_state_dict(torch.load(ve_path, map_location="cpu"))

# Generate
arabic_text = "هذا اختبار للنموذج المحسّن"
wav = model.generate(arabic_text, language_id="ar")

Advanced Usage: Voice Cloning

import torch
import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model with fine-tuned weights
model = ChatterboxMultilingualTTS.from_pretrained(device=device)
t3_path = hf_hub_download(
    repo_id="YOUR-USERNAME/chatterbox-arabic-finetuned",
    filename="t3_cfg.pt"
)
model.t3.load_state_dict(torch.load(t3_path, map_location="cpu"))

# Generate with reference audio (voice cloning)
arabic_text = "السلام عليكم ورحمة الله وبركاته"
reference_audio = "path/to/arabic_speaker.wav"  # 6+ seconds recommended

wav = model.generate(
    arabic_text,
    language_id="ar",
    audio_prompt_path=reference_audio,
    exaggeration=0.5,  # Control expressiveness (0.0-2.0)
    cfg_weight=0.5      # Control adherence to prompt (0.0-1.0)
)

ta.save("arabic_cloned_voice.wav", wav, model.sr)

Text with Diacritics (Tashkeel)

# The model handles Arabic text with or without diacritics
text_with_tashkeel = "مَرْحَباً بِكَ فِي عَالَمِ الذَّكَاءِ الاصْطِنَاعِيِّ"
text_without_tashkeel = "مرحبا بك في عالم الذكاء الاصطناعي"

# Both work well
wav1 = model.generate(text_with_tashkeel, language_id="ar")
wav2 = model.generate(text_without_tashkeel, language_id="ar")

🎛️ Parameters

exaggeration (0.0-2.0): Controls speech expressiveness
- 0.25: More monotone, robotic
- 0.5: Natural (default)
- 1.0-2.0: More dramatic and expressive
cfg_weight (0.0-1.0): Controls adherence to reference audio
- 0.3: Faster pacing
- 0.5: Balanced (default)
- 0.7+: More similar to reference
temperature (0.05-5.0): Controls randomness
- Lower: More consistent
- Higher: More variation

📊 Training Details

Base Model: ResembleAI/chatterbox multilingual
Fine-tuning Method: LoRA (Low-Rank Adaptation)
Target Language: Arabic (العربية)
Training Dataset: [Add your dataset info - e.g., "Arabic speech corpus with X hours"]
Training Duration: [Add training time/epochs]
Hardware: [Add GPU info if relevant]

🎯 Use Cases

Arabic audiobook narration
Arabic virtual assistants and voice agents
Arabic e-learning content
Arabic accessibility tools
Dubbing and voice-over for Arabic content
Arabic language learning applications

📁 Model Files

t3_cfg.pt - Text-to-speech transformer (main component) - 2.1 GB
conds.pt - Conditioning model - 107 KB
s3gen.pt - Speech generation model - 1.06 GB
ve.pt - Voice encoder - [size]
tokenizer.json - Tokenizer configuration

🌍 Supported Languages

While this model is optimized for Arabic, it maintains support for:

Arabic (ar) - Primary focus
English (en) - Secondary support

📝 Example Outputs

Modern Standard Arabic (MSA):

text = "الذكاء الاصطناعي يغير العالم من حولنا بطرق لم نتخيلها من قبل"

Common Phrases:

greetings = [
    "السلام عليكم ورحمة الله وبركاته",
    "صباح الخير",
    "مساء الخير",
    "أهلاً وسهلاً",
    "كيف حالك؟"
]

Numbers and Dates:

text = "اليوم هو الخامس عشر من يناير عام ألفين وستة وعشرين"

⚠️ Limitations

Works best with Modern Standard Arabic (MSA)
Dialectal Arabic may have varying quality depending on training data
Very long sentences (>200 words) should be split for best results
Reference audio for voice cloning should be clear and 6+ seconds long

📜 Citation

If you use this model, please cite the original Chatterbox work:

@misc{chatterboxtts2025,
  author = {{Resemble AI}},
  title = {{Chatterbox-TTS}},
  year = {2025},
  howpublished = {\url{https://github.com/resemble-ai/chatterbox}},
  note = {GitHub repository}
}

📄 License

This model inherits the MIT license from the base Chatterbox model.

🙏 Acknowledgments

Thanks to ResembleAI for the base Chatterbox model
[Add any dataset credits or collaborators]

📧 Contact

[Add your contact info or leave blank]

For issues or questions, please open an issue on the model repository.

Note: This model includes Resemble AI's Perth watermarking for generated audio.

Downloads last month: 9

Model tree for juliardi/chatterbox-multilingual-finetuned-arabic

Base model

ResembleAI/chatterbox

Adapter

(3)

this model