Configuration Parsing Warning:Invalid JSON for config file config.json
Kokoro-82M CoreML INT8
End-to-end CoreML export of hexgrad/Kokoro-82M with INT8 k-means palettization, optimized for Apple Neural Engine. Requires iOS 18+ / macOS 15+.
A single kokoro_5s.mlmodelc runs the full pipeline (BERT β duration
prediction β fixed-shape alignment β prosody β decoder) in one CoreML
call. G2P (grapheme-to-phoneme) is a separate pair of CoreML models.
Model
| Parameter | Value |
|---|---|
| Parameters | 82M |
| Precision | INT8 k-means palettization |
| Max audio length | 5 s (200 frames @ 40 fps) |
| Sample rate | 24 kHz |
| Style dimension | 256 |
| Max phonemes per pass | 128 |
Files
| File | Size | Description |
|---|---|---|
kokoro_5s.mlmodelc |
83 MB | Pre-compiled E2E model (pre-compiled, loads directly on-device) |
G2PEncoder.mlmodelc |
0.7 MB | Grapheme-to-phoneme encoder |
G2PDecoder.mlmodelc |
0.8 MB | Grapheme-to-phoneme decoder |
voices/ |
0.5 MB | 54 preset voice embeddings (10 languages) |
vocab_index.json |
4 KB | Phoneme vocabulary |
g2p_vocab.json |
4 KB | G2P vocabulary |
us_gold.json, us_silver.json |
6 MB | English pronunciation dictionaries |
pipeline_config.json |
4 KB | Swift pipeline config |
Quality
Compared to the FP16 reference export on a reference 1-second utterance
(af_heart voice, 14 phonemes) using the same CoreML inference path:
| Metric | Value |
|---|---|
| Predicted duration Ξ | 0 frames |
| Output sample count | identical |
| Log-spec distance | 0.42 (close to inaudible) |
| SI-SDR waveform | +0.01 dB |
| Size vs FP16 | β74% (83 MB vs 310 MB) |
Because CoreML k-means palettization is not deterministic (scikit-learn's k-means is unseeded), different exports land at different losses. This checkpoint was picked from the best of multiple export runs.
Voices
54 preset voices across 10 languages: English (US/UK), Spanish, French, Hindi, Italian, Japanese, Korean, Portuguese, Chinese.
Usage
Add speech-swift to Package.swift:
.package(url: "https://github.com/soniqo/speech-swift", branch: "main")
Then synthesize:
import KokoroTTS
let tts = try await KokoroTTSModel.fromPretrained(
modelId: "aufklarer/Kokoro-82M-CoreML-INT8"
)
let audio = try await tts.synthesize(
"Hello world, this is a Kokoro test.",
voice: "af_heart"
)
CLI:
swift run audio kokoro "Hello world" --voice af_heart --output out.wav
Source
- Base model: hexgrad/Kokoro-82M (Apache-2.0)
- Dictionaries and G2P: Apache-2.0
License
- Model weights: Apache-2.0
- CoreML conversion: Apache-2.0
Links
- speech-swift β Apple SDK
- soniqo.audio β website
- MLX vs CoreML on Apple Silicon β a practical guide β related blog post
- soniqo.audio/blog β blog
- Downloads last month
- 103