Parakeet-EOU-120M CoreML INT8

Streaming speech recognition with end-of-utterance detection, converted to CoreML for Apple Neural Engine inference. Part of speech-swift — on-device speech AI for Apple Silicon.

Based on nvidia/parakeet_realtime_eou_120m-v1 (FastConformer-RNNT, cache-aware streaming).

Quick Start

// Add to Package.swift:
// .package(url: "https://github.com/soniqo/speech-swift.git", branch: "main")

import ParakeetStreamingASR

let model = try await ParakeetStreamingASRModel.fromPretrained()

// Streaming transcription from microphone
for await partial in model.transcribeStream(audioChunks: micStream) {
    print(partial.text, partial.isFinal ? "[FINAL]" : "")
}

Or via CLI:

git clone https://github.com/soniqo/speech-swift && cd speech-swift && make build
.build/release/audio transcribe --engine parakeet-eou --stream recording.wav

Model

Property Value
Parameters 120M
Architecture FastConformer-RNNT (17-layer encoder, 1-layer LSTM decoder)
Format CoreML (.mlmodelc)
Quantization INT8 palettization (encoder)
Vocabulary 1024 BPE + EOU + EOB + blank (1027 total)
Sample rate 16 kHz
Streaming chunk 320ms (configurable)

Files

File Size Description
encoder.mlmodelc 102 MB Cache-aware FastConformer encoder (INT8)
decoder.mlmodelc 7.5 MB 1-layer LSTM prediction network
joint.mlmodelc 2.7 MB RNNT joint network (1027 outputs)
config.json <1 KB Model configuration and streaming params
vocab.json <1 KB BPE vocabulary (1026 tokens)

Performance

Benchmarked on Apple M2 Max, 320ms chunks, CoreML CPU+NE compute units.

Metric Value
Encoder latency (mean) 17.0 ms
Full pipeline latency (mean) 17.7 ms
Full pipeline latency (P95) 22.4 ms
RTF 0.055 (18x real-time)
Streaming cache 2.6 MB
Total model size 112.4 MB
Peak RSS delta (100 chunks) 0 MB

Streaming Cache Shapes

cache_last_channel:     [17, 1, 70, 512]   float32   2.4 MB
cache_last_time:        [17, 1, 512, 8]    float32   0.3 MB
cache_last_channel_len: [1]                int32
LSTM h:                 [1, 1, 640]        float16
LSTM c:                 [1, 1, 640]        float16

Usage

import ParakeetStreamingASR

let model = try await ParakeetStreamingASRModel.fromPretrained()

for await partial in model.transcribeStream(audioChunks: micStream) {
    print(partial.text, partial.isFinal ? "[FINAL]" : "")
}

EOU Detection

End-of-utterance is detected via a special <EOU> token (ID 1024) emitted by the joint network. No external VAD required for utterance segmentation.

Source

Converted from nvidia/parakeet_realtime_eou_120m-v1 using coremltools 8.3 with INT8 palettization.

Links

Downloads last month
337
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aufklarer/Parakeet-EOU-120M-CoreML-INT8

Finetuned
(3)
this model

Collection including aufklarer/Parakeet-EOU-120M-CoreML-INT8