Parakeet-TDT-CTC-110M CoreML

NVIDIA's Parakeet-TDT-CTC-110M model converted to CoreML format for efficient inference on Apple Silicon.

Model Description

This is a hybrid ASR model with a shared Conformer encoder and two decoder heads:

CTC Head: Fast greedy decoding, ideal for keyword spotting
TDT Head: Token-Duration Transducer for high-quality transcription

Architecture

Component	Description	Size
Preprocessor	Mel spectrogram extraction	~1 MB
Encoder	Conformer encoder (shared)	~400 MB
CTCHead	CTC output projection	~4 MB
Decoder	TDT prediction network (LSTM)	~25 MB
JointDecision	TDT joint network	~6 MB

Total size: ~436 MB

Performance

Benchmarked on Earnings22 dataset (772 audio files):

Metric	Value
Keyword Recall	100% (1309/1309)
WER	17.97%
RTFx (M4 Pro)	358x real-time

Requirements

macOS 13+ (Ventura or later)
Apple Silicon (M1/M2/M3/M4)
Python 3.10+

Installation

# Using uv (recommended)
uv sync

# Or using pip
pip install -e .

# For audio file support (WAV, MP3, etc.)
pip install -e ".[audio]"

Usage

Python Inference

from scripts.inference import ParakeetCoreML

# Load model (from current directory with .mlpackage files)
model = ParakeetCoreML(".")

# Transcribe with TDT (higher quality)
text = model.transcribe("audio.wav", mode="tdt")
print(text)

# Or use CTC for faster keyword spotting
text = model.transcribe("audio.wav", mode="ctc")
print(text)

Command Line

# TDT decoding (default, higher quality)
uv run scripts/inference.py --audio audio.wav

# CTC decoding (faster, good for keyword spotting)
uv run scripts/inference.py --audio audio.wav --mode ctc

Model Conversion

To convert from the original NeMo model:

# Install conversion dependencies
uv sync --extra convert

# Run conversion
uv run scripts/convert_nemo_to_coreml.py --output-dir ./model

This will:

Download the original model from NVIDIA (nvidia/parakeet-tdt_ctc-110m)
Convert each component to CoreML format
Extract vocabulary and create metadata

File Structure

./
├── Preprocessor.mlpackage    # Audio → Mel spectrogram
├── Encoder.mlpackage         # Mel → Encoder features
├── CTCHead.mlpackage         # Encoder → CTC log probs
├── Decoder.mlpackage         # TDT prediction network
├── JointDecision.mlpackage   # TDT joint network
├── vocab.json                # Token vocabulary (1024 tokens)
├── metadata.json             # Model configuration
├── pyproject.toml            # Python dependencies
├── uv.lock                   # Locked dependencies
└── scripts/                  # Inference & conversion scripts

Decoding Modes

TDT Mode (Recommended for Transcription)

Uses Token-Duration Transducer decoding
Higher accuracy (17.97% WER)
Predicts both tokens and durations
Best for full transcription tasks

CTC Mode (Recommended for Keyword Spotting)

Greedy CTC decoding
Faster inference
100% keyword recall on Earnings22
Best for detecting specific words/phrases

Custom Vocabulary / Keyword Spotting

For keyword spotting, CTC mode with custom vocabulary boosting achieves 100% recall:

# Load custom vocabulary with token IDs
with open("custom_vocab.json") as f:
    keywords = json.load(f)  # {"keyword": [token_ids], ...}

# Run CTC decoding
tokens = model.decode_ctc(encoder_output)

# Check for keyword matches
for keyword, expected_ids in keywords.items():
    if is_subsequence(expected_ids, tokens):
        print(f"Found keyword: {keyword}")

License

This model conversion is released under the Apache 2.0 License, same as the original NVIDIA model.

Citation

If you use this model, please cite the original NVIDIA work:

@misc{nvidia_parakeet_tdt_ctc,
  title={Parakeet-TDT-CTC-110M},
  author={NVIDIA},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/nvidia/parakeet-tdt_ctc-110m}
}

Acknowledgments

Original model by NVIDIA NeMo
CoreML conversion by FluidInference

Downloads last month: 152

Model tree for richtext/parakeet-ctc-110m-coreml

Base model

nvidia/parakeet-tdt_ctc-110m

Quantized

(6)

this model

Datasets used to train richtext/parakeet-ctc-110m-coreml

Evaluation results

Test WER on AMI (Meetings test)
test set self-reported

15.880
Test WER on Earnings-22
test set self-reported

12.420
Test WER on GigaSpeech
test set self-reported

10.520
Test WER on LibriSpeech (clean)
test set self-reported

2.400
Test WER on LibriSpeech (other)
test set self-reported

5.200
Test WER on SPGI Speech
test set self-reported

2.540
Test WER on tedlium-v3
test set self-reported

4.160
Test WER on Vox Populi
test set self-reported

6.910