CoreML Speech Models
Collection
Speech AI models for Apple Neural Engine via CoreML. iOS/macOS ready. ASR, TTS, VAD, diarization. • 14 items • Updated • 1
Streaming speech recognition with end-of-utterance detection, converted to CoreML for Apple Neural Engine inference. Part of speech-swift — on-device speech AI for Apple Silicon.
Based on nvidia/parakeet_realtime_eou_120m-v1 (FastConformer-RNNT, cache-aware streaming).
// Add to Package.swift:
// .package(url: "https://github.com/soniqo/speech-swift.git", branch: "main")
import ParakeetStreamingASR
let model = try await ParakeetStreamingASRModel.fromPretrained()
// Streaming transcription from microphone
for await partial in model.transcribeStream(audioChunks: micStream) {
print(partial.text, partial.isFinal ? "[FINAL]" : "")
}
Or via CLI:
git clone https://github.com/soniqo/speech-swift && cd speech-swift && make build
.build/release/audio transcribe --engine parakeet-eou --stream recording.wav
| Property | Value |
|---|---|
| Parameters | 120M |
| Architecture | FastConformer-RNNT (17-layer encoder, 1-layer LSTM decoder) |
| Format | CoreML (.mlmodelc) |
| Quantization | INT8 palettization (encoder) |
| Vocabulary | 1024 BPE + EOU + EOB + blank (1027 total) |
| Sample rate | 16 kHz |
| Streaming chunk | 320ms (configurable) |
| File | Size | Description |
|---|---|---|
encoder.mlmodelc |
102 MB | Cache-aware FastConformer encoder (INT8) |
decoder.mlmodelc |
7.5 MB | 1-layer LSTM prediction network |
joint.mlmodelc |
2.7 MB | RNNT joint network (1027 outputs) |
config.json |
<1 KB | Model configuration and streaming params |
vocab.json |
<1 KB | BPE vocabulary (1026 tokens) |
Benchmarked on Apple M2 Max, 320ms chunks, CoreML CPU+NE compute units.
| Metric | Value |
|---|---|
| Encoder latency (mean) | 17.0 ms |
| Full pipeline latency (mean) | 17.7 ms |
| Full pipeline latency (P95) | 22.4 ms |
| RTF | 0.055 (18x real-time) |
| Streaming cache | 2.6 MB |
| Total model size | 112.4 MB |
| Peak RSS delta (100 chunks) | 0 MB |
cache_last_channel: [17, 1, 70, 512] float32 2.4 MB
cache_last_time: [17, 1, 512, 8] float32 0.3 MB
cache_last_channel_len: [1] int32
LSTM h: [1, 1, 640] float16
LSTM c: [1, 1, 640] float16
import ParakeetStreamingASR
let model = try await ParakeetStreamingASRModel.fromPretrained()
for await partial in model.transcribeStream(audioChunks: micStream) {
print(partial.text, partial.isFinal ? "[FINAL]" : "")
}
End-of-utterance is detected via a special <EOU> token (ID 1024) emitted by the joint network. No external VAD required for utterance segmentation.
Converted from nvidia/parakeet_realtime_eou_120m-v1 using coremltools 8.3 with INT8 palettization.
Base model
nvidia/parakeet_realtime_eou_120m-v1