F5-TTS Hungarian — Native Magyar Text-to-Speech with Zero-Shot Voice Cloning

The first and only Hungarian fine-tuned F5-TTS model with zero-shot voice cloning from just 5-15 seconds of reference audio.


Highlights

  • Native Hungarian — Trained on 223,000+ Hungarian speech samples (~280 hours)
  • Zero-shot Voice Cloning — Clone any voice with just 5-15 seconds of reference audio
  • Runs Locally — No cloud API needed, runs on consumer GPUs (8GB+ VRAM)
  • Fast Inference — Real-time factor < 0.5 on RTX 3060+
  • Privacy First — All processing happens on your machine

Audio Samples

Listen to what the model can do based on 5-15 seconds of Hungarian reference audio!

Reference Audio (Original) Generated Audio (F5-TTS Clone)

Model Details

Base Model F5-TTS v1 Base (DiT 1024d / 22 depth / 16 heads)
Pretrained Checkpoint SWivid/F5-TTS/F5TTS_v1_Base/model_1200000.safetensors
Training Hardware NVIDIA H100 SXM 80GB HBM3
Training Duration ~50 epochs, stopped at loss plateau (0.63)
Total Samples 223,103
Total Audio ~280 hours
Vocabulary 67 tokens (Hungarian UTF-8 characters + digits)
Sample Rate 24,000 Hz
Checkpoint Format FP16 safetensors (~640 MB)

Training Datasets

Dataset Samples Hours Description
YodaLingua Hungarian ~80,740 ~206h High-quality curated Hungarian speech corpus
Common Voice HU 17.0 ~59,925 ~54h Mozilla community-contributed recordings
CSS10 Hungarian ~4,474 ~10h Single-speaker professional narration (Egri Csillagok)

Training Parameters

Parameter Value
Learning Rate 1e-5
Warmup Steps 500
Batch Size 38,400 frames
Precision Mixed Precision (BF16)
Optimizer AdamW
EMA Yes (used for inference)

Quick Start

import torch
import torchaudio
import soundfile as sf
import numpy as np

# Monkey-patch torchaudio for cross-platform compatibility
_orig_load = torchaudio.load
def _patched_load(fp, **kw):
    d, sr = sf.read(str(fp), dtype="float32")
    if d.ndim == 1: d = d[np.newaxis, :]
    else: d = d.T
    return torch.from_numpy(d), sr
torchaudio.load = _patched_load

from f5_tts.api import F5TTS

model = F5TTS(
    model="F5TTS_v1_Base",
    ckpt_file="model_last_final.safetensors",
    vocab_file="vocab.txt",
    device="cuda",
    use_ema=True,
)

wav, sr, _ = model.infer(
    ref_file="your_reference.wav",
    ref_text="A referencia hang pontos atirata.",
    gen_text="Szia, ez egy teszt mondat a magyar szovegfelolvasashoz.",
)

sf.write("output.wav", wav, sr)

Critical: Reference Text Must Match Audio

The ref_text must be the exact transcription of ref_audio. Mismatched text causes garbled output. Use Whisper to transcribe:

from faster_whisper import WhisperModel
whisper = WhisperModel("large-v3-turbo", device="cuda")
segments, _ = whisper.transcribe("your_reference.wav", language="hu")
ref_text = " ".join(s.text.strip() for s in segments)

Benchmark

Measured on NVIDIA RTX 5060 Ti 16GB, PyTorch 2.x, CUDA, FP16:

Metric Value
Real-Time Factor (RTF) ~0.25 (10s of audio generated in ~2.5s)
Model Load Time ~4.4s (first load, CUDA)
Warmup Inference ~2.5s (first generation)
Steady-State Latency ~250ms per sentence
Peak VRAM Usage ~2.4 GB (Highly optimized for consumer GPUs)
Min VRAM Recommended 4 GB (FP16)

Reference Audio Guidelines

For optimal voice cloning quality:

  • Use 5-15 seconds of clean speech (no background noise, no reverb, no music)
  • Provide an exact transcript — this is the single most important quality factor
  • Mono WAV, any sample rate (automatically resampled to 24kHz)
  • Longer references (>15s) do NOT improve quality and increase VRAM usage
  • End the reference audio with a natural sentence ending, not mid-word

Repository Contents

File Size Description
model_last_final.safetensors ~640 MB FP16 EMA model weights
vocab.txt <1 KB 67-token Hungarian character vocabulary (required for loading)
config.json <1 KB Model architecture and training configuration
inference_example.py ~6 KB Complete inference script with Whisper transcription + artifact trimming
hungarian_preprocessing.py ~3 KB Text normalizer (numbers, symbols, acronyms to spoken Hungarian)
samples/ 20 reference + 20 cloned audio pairs for evaluation

Known Issues & Workarounds

Onset Artifact (First-Word Distortion)

The model occasionally produces a brief garbled sound (~200-400ms) at the very beginning of generated audio. This is a known characteristic of the F5-TTS DiT architecture — the hard mel-spectrogram boundary between reference and generated speech does not always align with the acoustic boundary.

Workarounds (in order of effectiveness):

  1. Prefix trick — Prepend a short filler word to gen_text (e.g., "szoval, " + your_text), then trim the first ~400ms from the output
  2. Silence padding — Add 500-700ms of silence to the end of your reference audio before inference
  3. VAD-based trimming — Use Silero VAD to detect actual speech onset and trim everything before it
  4. Regeneration — Different random seeds produce different alignment; retry 2-3 times if needed

The included inference_example.py implements an energy-based adaptive trimmer. Community contributions for better solutions are welcome!

Duration Estimation

Very short texts (<5 characters) may produce stretched or compressed audio. The model estimates duration from the ref_text/gen_text character ratio — extremely short inputs can break this heuristic. Workaround: pad short texts with natural filler.

Roadmap

  • Improved onset artifact elimination (Silero VAD + soft mel crossfade)
  • Streaming inference support (sentence-level chunking)
  • ComfyUI custom node
  • Companion model: Whisper Hungarian LoRA for STT
  • Extended vocabulary with English code-switching support
  • INT8 quantized variant for lower VRAM (<4GB)

Community & Contributions

The model weights are released under CC-BY-NC-4.0 (following the base F5-TTS license). The inference code, preprocessing scripts, and tooling are open for contributions!

We welcome:

  • Better Hungarian text normalization rules
  • Improved onset artifact handling
  • New reference audio / evaluation pairs
  • Bug reports and quality feedback
  • Translations of documentation

GitHub (Inference toolkit & Issues): github.com/Maxdorger/f5-tts-hungarian Hugging Face (Model Weights): huggingface.co/Maxdorger29/f5-tts-hungarian

Support the Project

Training high-quality TTS models on H100 GPUs is expensive. If this model is useful to you, consider buying me a coffee — it directly funds compute time for future improvements!

Ko-fi

License

This model is released under CC-BY-NC-4.0, consistent with the base F5-TTS model license.

  • Personal & Research use: Free
  • Commercial use: Not permitted under this license (upstream restriction from F5-TTS base model)
  • Attribution: Credit "F5-TTS Hungarian by Maxdorger29" in any published work

Acknowledgments

Citation

@misc{f5tts-hungarian-2026,
  title={F5-TTS Hungarian: Native Magyar Text-to-Speech with Zero-Shot Voice Cloning},
  author={Maxdorger29},
  year={2026},
  url={https://huggingface.co/Maxdorger29/f5-tts-hungarian},
  note={Fine-tuned on 223k Hungarian speech samples using NVIDIA H100}
}
Downloads last month
41
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Maxdorger29/f5-tts-hungarian

Base model

SWivid/F5-TTS
Finetuned
(83)
this model

Datasets used to train Maxdorger29/f5-tts-hungarian