KenLM n-gram LMs for Parakeet TDT shallow fusion

Subword-level KenLM ARPAs built over the nvidia/parakeet-tdt-0.6b-v3 tokenizer, intended for use with Vernacula's Parakeet beam decoder via shallow LM fusion. Each file's "words" are Parakeet subword IDs (integers), not natural-language words — this lets the decoder score the LM at every beam expansion without any subword-to-word round-trip.

Other Parakeet checkpoints share vocabulary layouts so these LMs may work against them too, but only v3 has been validated.

Files

File	Corpus	Order	Size (gz)	Target register
`en-general.arpa.gz`	3× GigaSpeech `s` + 1× People's Speech `clean` (~22 M effective words)	4	67 MB	Conversational English with cased + punctuated output
`en-medical.arpa.gz`	5× MTSamples + 2× synthetic clinical dialogue + 2× class-aware drug dialogue (~77 M effective words)	3	17 MB	Medical English — clinical dictation + patient↔doctor dialogue + specialty drug names

More languages / domains coming as they're validated.

Design: speech-register per-domain corpus

Early iterations of en-medical mixed MTSamples (spoken dictation) with HealthCareMagic (patient-written forum Q&A) and PubMed abstracts (formal written prose), then layered that over an en-general base. This performed worse than plain en-general on medical-entity F1.

Investigation showed shallow fusion is a style predictor, not a knowledge predictor: the LM biases the decoder toward sequences it's seen, and written-prose medical text pushes the decoder into patterns that don't match conversational clinical audio. The "general base" layer was helping purely because GigaSpeech and People's Speech are spoken transcripts — not because of their domain coverage.

The current en-medical is therefore built entirely from spoken-register medical content:

MTSamples (~2.2 M words, natural clinical dictation: H&P, SOAP notes, op reports) — upweighted 5× to compensate for its small raw size.
CodCodingCode/cleaned-clinical-conversations (~25 M deduped words of synthetic doctor↔patient dialogue spanning dozens of specialties and presentations) — upweighted 2×.

No en-general base, no forum text, no journal abstracts. The result matches earlier layered variants on fluency and disease recall while being 3–4× smaller.

Specialty drug coverage via class-aware template dialogue

MTSamples + the synthetic-dialogue corpus use mostly lay drug names ("paracetamol", "ibuprofen"); specialty drugs (sertraline, olanzapine, apixaban, paclitaxel) appear rarely. Earlier log-prob probes showed weak per-token priors on these (−2.0 to −3.0, vs −0.7 to −1.0 for well-covered phrases).

The current build adds 2× of template-generated speech-register drug dialogue, filled from a curated catalog of ~330 generic drug names across 35 WHO-ATC-style classes paired with class-appropriate conditions:

"started on lithium for bipolar disorder"
"I've been taking apixaban for my atrial fibrillation"
"patient is on paclitaxel for breast cancer"
"can I get a refill of my sertraline"

Class-aware pairing means we don't produce implausible combinations like "lithium for sinusitis". This lifts specialty-drug log-probs to the -0.7 to -1.2 range (comparable to well-covered general phrases) without degrading general fluency.

PriMock-57 validation ties the earlier variant on entity F1 (since PriMock audio doesn't contain specialty drug mentions to exercise the improved priors). Users with dictated psychiatry / oncology / cardiology content should see the benefit directly.

Usage

In Vernacula (recommended)

Settings → Speech Recognition → Language model → pick "English (General)". The app downloads the file automatically and wires it into the beam decoder.

From the Vernacula CLI

vernacula --audio sample.wav --model <parakeet-dir> \
  --lm en-general.arpa.gz \
  --lm-weight 0.15

Passing --lm auto-bumps beam width to 4 (fusion has no effect in greedy mode). Typical fusion weight is 0.1–0.3; 0.15 is the default used by Vernacula's Settings picker.

Directly with the KenLM Python bindings

import kenlm
lm = kenlm.Model("en-general.arpa.gz")
# tokens must be Parakeet subword IDs stringified:
score = lm.score("42 17 5", bos=False, eos=False)

How `en-medical.arpa.gz` was built

# Extract MTSamples (galileo-ai/medical_transcription_40, text column).
# Extract CodCodingCode/cleaned-clinical-conversations, stripping
#   DOCTOR:/PATIENT: prefixes and deduping turns across overlapping rows.
# Generate ~8M words of class-aware drug dialogue using the curated
#   drug_classes.json catalog (see scripts/kenlm_build/ in Vernacula).
# Layer:
#   (for _ in 1..5; cat mtsamples.txt; end
#    for _ in 1..2; cat synthetic-dialogue.txt; end
#    for _ in 1..2; cat drug-dialogue.txt; end) > en-medical.corpus.txt
# Tokenise with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json, then
#   lmplz --order 3 --prune "0 0 1" --discount_fallback
# gzip the ARPA.

Upstream corpora have public-redistribution precedent in the medical NLP literature; attribution is required under the CC-BY-4.0 umbrella of this derivative.

How `en-general.arpa.gz` was built

# Corpus:
#   3x GigaSpeech `s` subset  (Apache 2.0, cased + punctuation-tagged)
#   1x People's Speech `clean` subset  (CC-BY-4.0, lowercase, no punct)
# Rationale: People's Speech carries conversational-register priors
# (backchannels, disfluencies). GigaSpeech carries the case + punctuation
# style Parakeet's output expects. 3x upweight on GigaSpeech balances the
# raw size asymmetry so the case signal survives without being drowned out.

# Tokenised with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json:
#   ~27M subwords across ~1M sentences.

# Built with KenLM's lmplz:
lmplz -o 4 --prune 0 0 1 1 --discount_fallback \
      --vocab_estimate 8193 \
      --text en-mixed.tok --arpa en-general.arpa
gzip en-general.arpa

See Vernacula's scripts/kenlm_build/ for the exact scripts.

Validation

On a 600 s en-US conversational sample (157 VAD segments, held-out from the corpus), Vernacula's Parakeet decoder exhibits a known beam=4 multilingual-drift regression where an English backchannel "Uh uh. Ya." transcribes as Spanish "ajá, ya". With this LM fused at weight 0.15 the line recovers to "Uh uh, yeah." and ~500 other lines stay unchanged vs greedy decoding — proper nouns preserved, punctuation preserved.

License & attribution

This LM is a derivative of its training corpora. It's released under CC-BY-4.0 (the union of its sources' terms).

Upstream corpora (attribution required):

MLCommons/peoples_speech — transcripts portion, CC-BY-4.0. (en-general)
speechcolab/gigaspeech — transcripts portion, Apache 2.0. (en-general)
galileo-ai/medical_transcription_40 — MTSamples mirror, clinical dictation. (en-medical)
CodCodingCode/cleaned-clinical-conversations — synthetic (LLM-generated) doctor↔patient dialogues across a broad range of conditions and specialties. (en-medical)

Caveats

Subword-ID-keyed. Not a word-level LM — you can't load it in a word-level ASR decoder without first mapping through Parakeet's tokenizer.
4-gram with --prune 0 0 1 1 (drops 3-grams and 4-grams seen exactly once). If you need tighter priors, rebuild with --prune 0 0 0 0 at the cost of roughly 3× file size.
Best at capturing local lexical choice (e.g. "uh huh" vs "ajá"). Doesn't carry sentence-level priors that might rescue wholly-ambiguous short utterances (e.g. a standalone "ajá" with no context).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support