Parakeet-EOU-120M CoreML INT8

Streaming speech recognition with end-of-utterance detection, converted to CoreML for Apple Neural Engine inference. Part of speech-swift — on-device speech AI for Apple Silicon.

Based on nvidia/parakeet_realtime_eou_120m-v1 (FastConformer-RNNT, cache-aware streaming).

Quick Start

// Add to Package.swift:
// .package(url: "https://github.com/soniqo/speech-swift.git", branch: "main")

import ParakeetStreamingASR

let model = try await ParakeetStreamingASRModel.fromPretrained()

// Streaming transcription from microphone
for await partial in model.transcribeStream(audioChunks: micStream) {
    print(partial.text, partial.isFinal ? "[FINAL]" : "")
}

Or via CLI:

git clone https://github.com/soniqo/speech-swift && cd speech-swift && make build
.build/release/audio transcribe --engine parakeet-eou --stream recording.wav

Model

Property	Value
Parameters	120M
Architecture	FastConformer-RNNT (17-layer encoder, 1-layer LSTM decoder)
Format	CoreML (.mlmodelc)
Quantization	INT8 palettization (encoder)
Vocabulary	1024 BPE + EOU + EOB + blank (1027 total)
Sample rate	16 kHz
Streaming chunk	320ms (configurable)

Files

File	Size	Description
`encoder.mlmodelc`	102 MB	Cache-aware FastConformer encoder (INT8)
`decoder.mlmodelc`	7.5 MB	1-layer LSTM prediction network
`joint.mlmodelc`	2.7 MB	RNNT joint network (1027 outputs)
`config.json`	<1 KB	Model configuration and streaming params
`vocab.json`	<1 KB	BPE vocabulary (1026 tokens)

Performance

Benchmarked on Apple M2 Max, 320ms chunks, CoreML CPU+NE compute units.

Metric	Value
Encoder latency (mean)	17.0 ms
Full pipeline latency (mean)	17.7 ms
Full pipeline latency (P95)	22.4 ms
RTF	0.055 (18x real-time)
Streaming cache	2.6 MB
Total model size	112.4 MB
Peak RSS delta (100 chunks)	0 MB

Streaming Cache Shapes

cache_last_channel:     [17, 1, 70, 512]   float32   2.4 MB
cache_last_time:        [17, 1, 512, 8]    float32   0.3 MB
cache_last_channel_len: [1]                int32
LSTM h:                 [1, 1, 640]        float16
LSTM c:                 [1, 1, 640]        float16

Usage

import ParakeetStreamingASR

let model = try await ParakeetStreamingASRModel.fromPretrained()

for await partial in model.transcribeStream(audioChunks: micStream) {
    print(partial.text, partial.isFinal ? "[FINAL]" : "")
}

EOU Detection

End-of-utterance is detected via a special <EOU> token (ID 1024) emitted by the joint network. No external VAD required for utterance segmentation.

Source

Converted from nvidia/parakeet_realtime_eou_120m-v1 using coremltools 8.3 with INT8 palettization.

Model tree for aufklarer/Parakeet-EOU-120M-CoreML-INT8

Base model

nvidia/parakeet_realtime_eou_120m-v1

Finetuned

(3)

this model

Collection including aufklarer/Parakeet-EOU-120M-CoreML-INT8

CoreML Speech Models

Collection

Speech AI models for Apple Neural Engine via CoreML. iOS/macOS ready. ASR, TTS, VAD, diarization. • 14 items • Updated 4 days ago • 1

aufklarer
/

Parakeet-EOU-120M-CoreML-INT8