Parakeet TDT 0.6B v3 (MLX, Encoder INT8)

NVIDIA Parakeet TDT v3 with encoder-only INT8 quantization โ€” the recommended variant for most users. Zero WER degradation, 30% faster, 58% less memory than BF16.

See also:

Why INT8?

Metric BF16 INT8 Change
WER (LibriSpeech) 0.82% 0.82% None
WER (TED-LIUM) 15.1% 15.1% None
RTFx 73x 95x +30%
Peak Memory 3,002 MB 1,268 MB -58%
Weight Size 1,254 MB 755 MB -40%

Benchmarked on Apple M3 Max (64 GB), macOS Sequoia 15.7.3, MLX 0.30.4.

Quantization Details

Only the Conformer encoder (~85% of parameters) is quantized to INT8 (group_size=64). The decoder and joint network remain in BF16, preserving precision for token generation.

Usage

from parakeet import from_pretrained
import mlx.core as mx

model = from_pretrained("sonic-speech/parakeet-tdt-0.6b-v3-int8", dtype=mx.bfloat16)
result = model.transcribe("audio.wav")

Install: pip install parakeet-mlx

Origin

Quantized from sonic-speech/parakeet-tdt-0.6b-v3 using mlx.nn.quantize (encoder-only, 8-bit, group_size=64).

Part of the Sonic Speech model collection.

Downloads last month
39
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sonic-speech/parakeet-tdt-0.6b-v3-int8

Finetuned
(2)
this model