Parakeet TDT 0.6B v3 (MLX, Encoder INT8)

NVIDIA Parakeet TDT v3 with encoder-only INT8 quantization — the recommended variant for most users. Zero WER degradation, 30% faster, 58% less memory than BF16.

Why INT8?

Metric	BF16	INT8	Change
WER (LibriSpeech)	0.82%	0.82%	None
WER (TED-LIUM)	15.1%	15.1%	None
RTFx	73x	95x	+30%
Peak Memory	3,002 MB	1,268 MB	-58%
Weight Size	1,254 MB	755 MB	-40%

Benchmarked on Apple M3 Max (64 GB), macOS Sequoia 15.7.3, MLX 0.30.4.

Quantization Details

Only the Conformer encoder (~85% of parameters) is quantized to INT8 (group_size=64). The decoder and joint network remain in BF16, preserving precision for token generation.

Usage

from parakeet import from_pretrained
import mlx.core as mx

model = from_pretrained("sonic-speech/parakeet-tdt-0.6b-v3-int8", dtype=mx.bfloat16)
result = model.transcribe("audio.wav")

Install: pip install parakeet-mlx

Origin

Quantized from sonic-speech/parakeet-tdt-0.6b-v3 using mlx.nn.quantize (encoder-only, 8-bit, group_size=64).

Part of the Sonic Speech model collection.

Downloads last month: 39

MLX

Hardware compatibility

Quantized

Model tree for sonic-speech/parakeet-tdt-0.6b-v3-int8

Base model

nvidia/parakeet-tdt-0.6b-v3

Finetuned

sonic-speech/parakeet-tdt-0.6b-v3

Finetuned

(2)

this model