F5-TTS Hungarian — Native Magyar Text-to-Speech with Zero-Shot Voice Cloning

The first and only Hungarian fine-tuned F5-TTS model with zero-shot voice cloning from just 5-15 seconds of reference audio.

Highlights

Native Hungarian — Trained on 223,000+ Hungarian speech samples (~280 hours)
Zero-shot Voice Cloning — Clone any voice with just 5-15 seconds of reference audio
Runs Locally — No cloud API needed, runs on consumer GPUs (8GB+ VRAM)
Fast Inference — Real-time factor < 0.5 on RTX 3060+
Privacy First — All processing happens on your machine

Audio Samples

Listen to what the model can do based on 5-15 seconds of Hungarian reference audio!

Reference Audio (Original)	Generated Audio (F5-TTS Clone)

Model Details


Base Model	F5-TTS v1 Base (DiT 1024d / 22 depth / 16 heads)
Pretrained Checkpoint	`SWivid/F5-TTS/F5TTS_v1_Base/model_1200000.safetensors`
Training Hardware	NVIDIA H100 SXM 80GB HBM3
Training Duration	~50 epochs, stopped at loss plateau (0.63)
Total Samples	223,103
Total Audio	~280 hours
Vocabulary	67 tokens (Hungarian UTF-8 characters + digits)
Sample Rate	24,000 Hz
Checkpoint Format	FP16 safetensors (~640 MB)

Training Datasets

Dataset	Samples	Hours	Description
YodaLingua Hungarian	~80,740	~206h	High-quality curated Hungarian speech corpus
Common Voice HU 17.0	~59,925	~54h	Mozilla community-contributed recordings
CSS10 Hungarian	~4,474	~10h	Single-speaker professional narration (Egri Csillagok)

Training Parameters

Parameter	Value
Learning Rate	1e-5
Warmup Steps	500
Batch Size	38,400 frames
Precision	Mixed Precision (BF16)
Optimizer	AdamW
EMA	Yes (used for inference)

Quick Start

import torch
import torchaudio
import soundfile as sf
import numpy as np

# Monkey-patch torchaudio for cross-platform compatibility
_orig_load = torchaudio.load
def _patched_load(fp, **kw):
    d, sr = sf.read(str(fp), dtype="float32")
    if d.ndim == 1: d = d[np.newaxis, :]
    else: d = d.T
    return torch.from_numpy(d), sr
torchaudio.load = _patched_load

from f5_tts.api import F5TTS

model = F5TTS(
    model="F5TTS_v1_Base",
    ckpt_file="model_last_final.safetensors",
    vocab_file="vocab.txt",
    device="cuda",
    use_ema=True,
)

wav, sr, _ = model.infer(
    ref_file="your_reference.wav",
    ref_text="A referencia hang pontos atirata.",
    gen_text="Szia, ez egy teszt mondat a magyar szovegfelolvasashoz.",
)

sf.write("output.wav", wav, sr)

Critical: Reference Text Must Match Audio

The ref_text must be the exact transcription of ref_audio. Mismatched text causes garbled output. Use Whisper to transcribe:

from faster_whisper import WhisperModel
whisper = WhisperModel("large-v3-turbo", device="cuda")
segments, _ = whisper.transcribe("your_reference.wav", language="hu")
ref_text = " ".join(s.text.strip() for s in segments)

Benchmark

Measured on NVIDIA RTX 5060 Ti 16GB, PyTorch 2.x, CUDA, FP16:

Metric	Value
Real-Time Factor (RTF)	~0.25 (10s of audio generated in ~2.5s)
Model Load Time	~4.4s (first load, CUDA)
Warmup Inference	~2.5s (first generation)
Steady-State Latency	~250ms per sentence
Peak VRAM Usage	~2.4 GB (Highly optimized for consumer GPUs)
Min VRAM Recommended	4 GB (FP16)

Reference Audio Guidelines

For optimal voice cloning quality:

Use 5-15 seconds of clean speech (no background noise, no reverb, no music)
Provide an exact transcript — this is the single most important quality factor
Mono WAV, any sample rate (automatically resampled to 24kHz)
Longer references (>15s) do NOT improve quality and increase VRAM usage
End the reference audio with a natural sentence ending, not mid-word

Repository Contents

File	Size	Description
`model_last_final.safetensors`	~640 MB	FP16 EMA model weights
`vocab.txt`	<1 KB	67-token Hungarian character vocabulary (required for loading)
`config.json`	<1 KB	Model architecture and training configuration
`inference_example.py`	~6 KB	Complete inference script with Whisper transcription + artifact trimming
`hungarian_preprocessing.py`	~3 KB	Text normalizer (numbers, symbols, acronyms to spoken Hungarian)
`samples/`	—	20 reference + 20 cloned audio pairs for evaluation

Known Issues & Workarounds

Onset Artifact (First-Word Distortion)

The model occasionally produces a brief garbled sound (~200-400ms) at the very beginning of generated audio. This is a known characteristic of the F5-TTS DiT architecture — the hard mel-spectrogram boundary between reference and generated speech does not always align with the acoustic boundary.

Workarounds (in order of effectiveness):

Prefix trick — Prepend a short filler word to gen_text (e.g., "szoval, " + your_text), then trim the first ~400ms from the output
Silence padding — Add 500-700ms of silence to the end of your reference audio before inference
VAD-based trimming — Use Silero VAD to detect actual speech onset and trim everything before it
Regeneration — Different random seeds produce different alignment; retry 2-3 times if needed

The included inference_example.py implements an energy-based adaptive trimmer. Community contributions for better solutions are welcome!

Duration Estimation

Very short texts (<5 characters) may produce stretched or compressed audio. The model estimates duration from the ref_text/gen_text character ratio — extremely short inputs can break this heuristic. Workaround: pad short texts with natural filler.

Roadmap

Improved onset artifact elimination (Silero VAD + soft mel crossfade)
Streaming inference support (sentence-level chunking)
ComfyUI custom node
Companion model: Whisper Hungarian LoRA for STT
Extended vocabulary with English code-switching support
INT8 quantized variant for lower VRAM (<4GB)

Community & Contributions

The model weights are released under CC-BY-NC-4.0 (following the base F5-TTS license). The inference code, preprocessing scripts, and tooling are open for contributions!

We welcome:

Better Hungarian text normalization rules
Improved onset artifact handling
New reference audio / evaluation pairs
Bug reports and quality feedback
Translations of documentation

GitHub (Inference toolkit & Issues): github.com/Maxdorger/f5-tts-hungarian Hugging Face (Model Weights): huggingface.co/Maxdorger29/f5-tts-hungarian

Support the Project

Training high-quality TTS models on H100 GPUs is expensive. If this model is useful to you, consider buying me a coffee — it directly funds compute time for future improvements!

License

This model is released under CC-BY-NC-4.0, consistent with the base F5-TTS model license.

Personal & Research use: Free
Commercial use: Not permitted under this license (upstream restriction from F5-TTS base model)
Attribution: Credit "F5-TTS Hungarian by Maxdorger29" in any published work

Acknowledgments

SWivid/F5-TTS — Base model architecture and pretrained weights
Mozilla Common Voice — Hungarian speech data
YodaLingua — Hungarian speech dataset
CSS10 — Hungarian single-speaker data

Citation

@misc{f5tts-hungarian-2026,
  title={F5-TTS Hungarian: Native Magyar Text-to-Speech with Zero-Shot Voice Cloning},
  author={Maxdorger29},
  year={2026},
  url={https://huggingface.co/Maxdorger29/f5-tts-hungarian},
  note={Fine-tuned on 223k Hungarian speech samples using NVIDIA H100}
}

Downloads last month: 41

Model tree for Maxdorger29/f5-tts-hungarian

Base model

SWivid/F5-TTS

Finetuned

(83)

this model

Maxdorger29
/

f5-tts-hungarian