F5-TTS Hungarian — Native Magyar Text-to-Speech with Zero-Shot Voice Cloning
The first and only Hungarian fine-tuned F5-TTS model with zero-shot voice cloning from just 5-15 seconds of reference audio.
Highlights
- Native Hungarian — Trained on 223,000+ Hungarian speech samples (~280 hours)
- Zero-shot Voice Cloning — Clone any voice with just 5-15 seconds of reference audio
- Runs Locally — No cloud API needed, runs on consumer GPUs (8GB+ VRAM)
- Fast Inference — Real-time factor < 0.5 on RTX 3060+
- Privacy First — All processing happens on your machine
Audio Samples
Listen to what the model can do based on 5-15 seconds of Hungarian reference audio!
| Reference Audio (Original) | Generated Audio (F5-TTS Clone) |
|---|---|
Model Details
| Base Model | F5-TTS v1 Base (DiT 1024d / 22 depth / 16 heads) |
| Pretrained Checkpoint | SWivid/F5-TTS/F5TTS_v1_Base/model_1200000.safetensors |
| Training Hardware | NVIDIA H100 SXM 80GB HBM3 |
| Training Duration | ~50 epochs, stopped at loss plateau (0.63) |
| Total Samples | 223,103 |
| Total Audio | ~280 hours |
| Vocabulary | 67 tokens (Hungarian UTF-8 characters + digits) |
| Sample Rate | 24,000 Hz |
| Checkpoint Format | FP16 safetensors (~640 MB) |
Training Datasets
| Dataset | Samples | Hours | Description |
|---|---|---|---|
| YodaLingua Hungarian | ~80,740 | ~206h | High-quality curated Hungarian speech corpus |
| Common Voice HU 17.0 | ~59,925 | ~54h | Mozilla community-contributed recordings |
| CSS10 Hungarian | ~4,474 | ~10h | Single-speaker professional narration (Egri Csillagok) |
Training Parameters
| Parameter | Value |
|---|---|
| Learning Rate | 1e-5 |
| Warmup Steps | 500 |
| Batch Size | 38,400 frames |
| Precision | Mixed Precision (BF16) |
| Optimizer | AdamW |
| EMA | Yes (used for inference) |
Quick Start
import torch
import torchaudio
import soundfile as sf
import numpy as np
# Monkey-patch torchaudio for cross-platform compatibility
_orig_load = torchaudio.load
def _patched_load(fp, **kw):
d, sr = sf.read(str(fp), dtype="float32")
if d.ndim == 1: d = d[np.newaxis, :]
else: d = d.T
return torch.from_numpy(d), sr
torchaudio.load = _patched_load
from f5_tts.api import F5TTS
model = F5TTS(
model="F5TTS_v1_Base",
ckpt_file="model_last_final.safetensors",
vocab_file="vocab.txt",
device="cuda",
use_ema=True,
)
wav, sr, _ = model.infer(
ref_file="your_reference.wav",
ref_text="A referencia hang pontos atirata.",
gen_text="Szia, ez egy teszt mondat a magyar szovegfelolvasashoz.",
)
sf.write("output.wav", wav, sr)
Critical: Reference Text Must Match Audio
The ref_text must be the exact transcription of ref_audio. Mismatched text causes garbled output. Use Whisper to transcribe:
from faster_whisper import WhisperModel
whisper = WhisperModel("large-v3-turbo", device="cuda")
segments, _ = whisper.transcribe("your_reference.wav", language="hu")
ref_text = " ".join(s.text.strip() for s in segments)
Benchmark
Measured on NVIDIA RTX 5060 Ti 16GB, PyTorch 2.x, CUDA, FP16:
| Metric | Value |
|---|---|
| Real-Time Factor (RTF) | ~0.25 (10s of audio generated in ~2.5s) |
| Model Load Time | ~4.4s (first load, CUDA) |
| Warmup Inference | ~2.5s (first generation) |
| Steady-State Latency | ~250ms per sentence |
| Peak VRAM Usage | ~2.4 GB (Highly optimized for consumer GPUs) |
| Min VRAM Recommended | 4 GB (FP16) |
Reference Audio Guidelines
For optimal voice cloning quality:
- Use 5-15 seconds of clean speech (no background noise, no reverb, no music)
- Provide an exact transcript — this is the single most important quality factor
- Mono WAV, any sample rate (automatically resampled to 24kHz)
- Longer references (>15s) do NOT improve quality and increase VRAM usage
- End the reference audio with a natural sentence ending, not mid-word
Repository Contents
| File | Size | Description |
|---|---|---|
model_last_final.safetensors |
~640 MB | FP16 EMA model weights |
vocab.txt |
<1 KB | 67-token Hungarian character vocabulary (required for loading) |
config.json |
<1 KB | Model architecture and training configuration |
inference_example.py |
~6 KB | Complete inference script with Whisper transcription + artifact trimming |
hungarian_preprocessing.py |
~3 KB | Text normalizer (numbers, symbols, acronyms to spoken Hungarian) |
samples/ |
— | 20 reference + 20 cloned audio pairs for evaluation |
Known Issues & Workarounds
Onset Artifact (First-Word Distortion)
The model occasionally produces a brief garbled sound (~200-400ms) at the very beginning of generated audio. This is a known characteristic of the F5-TTS DiT architecture — the hard mel-spectrogram boundary between reference and generated speech does not always align with the acoustic boundary.
Workarounds (in order of effectiveness):
- Prefix trick — Prepend a short filler word to
gen_text(e.g.,"szoval, " + your_text), then trim the first ~400ms from the output - Silence padding — Add 500-700ms of silence to the end of your reference audio before inference
- VAD-based trimming — Use Silero VAD to detect actual speech onset and trim everything before it
- Regeneration — Different random seeds produce different alignment; retry 2-3 times if needed
The included inference_example.py implements an energy-based adaptive trimmer. Community contributions for better solutions are welcome!
Duration Estimation
Very short texts (<5 characters) may produce stretched or compressed audio. The model estimates duration from the ref_text/gen_text character ratio — extremely short inputs can break this heuristic. Workaround: pad short texts with natural filler.
Roadmap
- Improved onset artifact elimination (Silero VAD + soft mel crossfade)
- Streaming inference support (sentence-level chunking)
- ComfyUI custom node
- Companion model: Whisper Hungarian LoRA for STT
- Extended vocabulary with English code-switching support
- INT8 quantized variant for lower VRAM (<4GB)
Community & Contributions
The model weights are released under CC-BY-NC-4.0 (following the base F5-TTS license). The inference code, preprocessing scripts, and tooling are open for contributions!
We welcome:
- Better Hungarian text normalization rules
- Improved onset artifact handling
- New reference audio / evaluation pairs
- Bug reports and quality feedback
- Translations of documentation
GitHub (Inference toolkit & Issues): github.com/Maxdorger/f5-tts-hungarian Hugging Face (Model Weights): huggingface.co/Maxdorger29/f5-tts-hungarian
Support the Project
Training high-quality TTS models on H100 GPUs is expensive. If this model is useful to you, consider buying me a coffee — it directly funds compute time for future improvements!
License
This model is released under CC-BY-NC-4.0, consistent with the base F5-TTS model license.
- Personal & Research use: Free
- Commercial use: Not permitted under this license (upstream restriction from F5-TTS base model)
- Attribution: Credit "F5-TTS Hungarian by Maxdorger29" in any published work
Acknowledgments
- SWivid/F5-TTS — Base model architecture and pretrained weights
- Mozilla Common Voice — Hungarian speech data
- YodaLingua — Hungarian speech dataset
- CSS10 — Hungarian single-speaker data
Citation
@misc{f5tts-hungarian-2026,
title={F5-TTS Hungarian: Native Magyar Text-to-Speech with Zero-Shot Voice Cloning},
author={Maxdorger29},
year={2026},
url={https://huggingface.co/Maxdorger29/f5-tts-hungarian},
note={Fine-tuned on 223k Hungarian speech samples using NVIDIA H100}
}
- Downloads last month
- 41
Model tree for Maxdorger29/f5-tts-hungarian
Base model
SWivid/F5-TTS