Chatterbox Arabic Fine-tuned TTS ๐Ÿ‡ธ๐Ÿ‡ฆ

This is an Arabic-focused fine-tuned version of ResembleAI/chatterbox multilingual model using LoRA (Low-Rank Adaptation).

๐ŸŽฏ Model Description

This model has been specifically fine-tuned to improve Arabic language text-to-speech synthesis quality, including:

  • Enhanced Arabic pronunciation and phonetics
  • Better handling of Arabic diacritics (ุชุดูƒูŠู„)
  • Improved intonation for Arabic speech patterns
  • Support for Modern Standard Arabic (MSA) and common dialects
  • Natural-sounding Arabic voice generation

โœจ Key Features

  • ๐Ÿ—ฃ๏ธ High-quality Arabic speech synthesis
  • ๐ŸŽญ Zero-shot voice cloning for Arabic speakers
  • โšก Fast inference (real-time capable)
  • ๐ŸŽš๏ธ Emotion/expression control
  • ๐Ÿ”„ Supports both Arabic and English (bilingual)

๐Ÿ“ฆ Installation

pip install chatterbox-tts torch torchaudio huggingface_hub

๐Ÿš€ Quick Start

Basic Arabic TTS

import torch
import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load base multilingual model
model = ChatterboxMultilingualTTS.from_pretrained(device=device)

# Download and apply Arabic fine-tuned weights
t3_path = hf_hub_download(
    repo_id="YOUR-USERNAME/chatterbox-arabic-finetuned",
    filename="t3_cfg.pt"
)
t3_state = torch.load(t3_path, map_location="cpu")
model.t3.load_state_dict(t3_state)

# Generate Arabic speech
arabic_text = "ู…ุฑุญุจุงู‹ ุจูƒ ููŠ ู†ู…ูˆุฐุฌ ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุงู„ู…ุญุณู‘ู† ู„ู„ุบุฉ ุงู„ุนุฑุจูŠุฉ"
wav = model.generate(arabic_text, language_id="ar")
ta.save("arabic_output.wav", wav, model.sr)

Load All Fine-tuned Components

import torch
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load base model
model = ChatterboxMultilingualTTS.from_pretrained(device=device)

# Download all fine-tuned weights
repo_id = "YOUR-USERNAME/chatterbox-arabic-finetuned"

t3_path = hf_hub_download(repo_id=repo_id, filename="t3_cfg.pt")
conds_path = hf_hub_download(repo_id=repo_id, filename="conds.pt")
s3gen_path = hf_hub_download(repo_id=repo_id, filename="s3gen.pt")
ve_path = hf_hub_download(repo_id=repo_id, filename="ve.pt")

# Load all components
model.t3.load_state_dict(torch.load(t3_path, map_location="cpu"))
model.conds.load_state_dict(torch.load(conds_path, map_location="cpu"))
model.s3gen.load_state_dict(torch.load(s3gen_path, map_location="cpu"))
model.ve.load_state_dict(torch.load(ve_path, map_location="cpu"))

# Generate
arabic_text = "ู‡ุฐุง ุงุฎุชุจุงุฑ ู„ู„ู†ู…ูˆุฐุฌ ุงู„ู…ุญุณู‘ู†"
wav = model.generate(arabic_text, language_id="ar")

Advanced Usage: Voice Cloning

import torch
import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model with fine-tuned weights
model = ChatterboxMultilingualTTS.from_pretrained(device=device)
t3_path = hf_hub_download(
    repo_id="YOUR-USERNAME/chatterbox-arabic-finetuned",
    filename="t3_cfg.pt"
)
model.t3.load_state_dict(torch.load(t3_path, map_location="cpu"))

# Generate with reference audio (voice cloning)
arabic_text = "ุงู„ุณู„ุงู… ุนู„ูŠูƒู… ูˆุฑุญู…ุฉ ุงู„ู„ู‡ ูˆุจุฑูƒุงุชู‡"
reference_audio = "path/to/arabic_speaker.wav"  # 6+ seconds recommended

wav = model.generate(
    arabic_text,
    language_id="ar",
    audio_prompt_path=reference_audio,
    exaggeration=0.5,  # Control expressiveness (0.0-2.0)
    cfg_weight=0.5      # Control adherence to prompt (0.0-1.0)
)

ta.save("arabic_cloned_voice.wav", wav, model.sr)

Text with Diacritics (Tashkeel)

# The model handles Arabic text with or without diacritics
text_with_tashkeel = "ู…ูŽุฑู’ุญูŽุจุงู‹ ุจููƒูŽ ูููŠ ุนูŽุงู„ูŽู…ู ุงู„ุฐูŽู‘ูƒูŽุงุกู ุงู„ุงุตู’ุทูู†ูŽุงุนููŠูู‘"
text_without_tashkeel = "ู…ุฑุญุจุง ุจูƒ ููŠ ุนุงู„ู… ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ"

# Both work well
wav1 = model.generate(text_with_tashkeel, language_id="ar")
wav2 = model.generate(text_without_tashkeel, language_id="ar")

๐ŸŽ›๏ธ Parameters

  • exaggeration (0.0-2.0): Controls speech expressiveness

    • 0.25: More monotone, robotic
    • 0.5: Natural (default)
    • 1.0-2.0: More dramatic and expressive
  • cfg_weight (0.0-1.0): Controls adherence to reference audio

    • 0.3: Faster pacing
    • 0.5: Balanced (default)
    • 0.7+: More similar to reference
  • temperature (0.05-5.0): Controls randomness

    • Lower: More consistent
    • Higher: More variation

๐Ÿ“Š Training Details

  • Base Model: ResembleAI/chatterbox multilingual
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • Target Language: Arabic (ุงู„ุนุฑุจูŠุฉ)
  • Training Dataset: [Add your dataset info - e.g., "Arabic speech corpus with X hours"]
  • Training Duration: [Add training time/epochs]
  • Hardware: [Add GPU info if relevant]

๐ŸŽฏ Use Cases

  • Arabic audiobook narration
  • Arabic virtual assistants and voice agents
  • Arabic e-learning content
  • Arabic accessibility tools
  • Dubbing and voice-over for Arabic content
  • Arabic language learning applications

๐Ÿ“ Model Files

  • t3_cfg.pt - Text-to-speech transformer (main component) - 2.1 GB
  • conds.pt - Conditioning model - 107 KB
  • s3gen.pt - Speech generation model - 1.06 GB
  • ve.pt - Voice encoder - [size]
  • tokenizer.json - Tokenizer configuration

๐ŸŒ Supported Languages

While this model is optimized for Arabic, it maintains support for:

  • Arabic (ar) - Primary focus
  • English (en) - Secondary support

๐Ÿ“ Example Outputs

Modern Standard Arabic (MSA):

text = "ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ ูŠุบูŠุฑ ุงู„ุนุงู„ู… ู…ู† ุญูˆู„ู†ุง ุจุทุฑู‚ ู„ู… ู†ุชุฎูŠู„ู‡ุง ู…ู† ู‚ุจู„"

Common Phrases:

greetings = [
    "ุงู„ุณู„ุงู… ุนู„ูŠูƒู… ูˆุฑุญู…ุฉ ุงู„ู„ู‡ ูˆุจุฑูƒุงุชู‡",
    "ุตุจุงุญ ุงู„ุฎูŠุฑ",
    "ู…ุณุงุก ุงู„ุฎูŠุฑ",
    "ุฃู‡ู„ุงู‹ ูˆุณู‡ู„ุงู‹",
    "ูƒูŠู ุญุงู„ูƒุŸ"
]

Numbers and Dates:

text = "ุงู„ูŠูˆู… ู‡ูˆ ุงู„ุฎุงู…ุณ ุนุดุฑ ู…ู† ูŠู†ุงูŠุฑ ุนุงู… ุฃู„ููŠู† ูˆุณุชุฉ ูˆุนุดุฑูŠู†"

โš ๏ธ Limitations

  • Works best with Modern Standard Arabic (MSA)
  • Dialectal Arabic may have varying quality depending on training data
  • Very long sentences (>200 words) should be split for best results
  • Reference audio for voice cloning should be clear and 6+ seconds long

๐Ÿ“œ Citation

If you use this model, please cite the original Chatterbox work:

@misc{chatterboxtts2025,
  author = {{Resemble AI}},
  title = {{Chatterbox-TTS}},
  year = {2025},
  howpublished = {\url{https://github.com/resemble-ai/chatterbox}},
  note = {GitHub repository}
}

๐Ÿ“„ License

This model inherits the MIT license from the base Chatterbox model.

๐Ÿ™ Acknowledgments

  • Thanks to ResembleAI for the base Chatterbox model
  • [Add any dataset credits or collaborators]

๐Ÿ“ง Contact

[Add your contact info or leave blank]

For issues or questions, please open an issue on the model repository.


Note: This model includes Resemble AI's Perth watermarking for generated audio.

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for juliardi/chatterbox-multilingual-finetuned-arabic

Adapter
(3)
this model