ποΈ Parakeet TDT 0.6B β Fine-Tuned for Children's Speech Recognition
π Try It Live β No Code Needed!
We built a free online demo where anyone can upload audio and get children's speech transcribed instantly!
π Click Here to Try the Demo
What you can do on the website:
- π Upload any audio file (WAV, MP3, FLAC, OGG, M4A)
- π€ Record directly from your microphone
- β‘ Get instant transcription results
- β±οΈ See processing time
- π Clear and try again
The website is completely free and requires no account or installation!
π§ What is This Model?
Think of this model as a specialist teacher's assistant who has been specially trained to understand children's speech.
Most speech recognition systems fail on children's voices because:
| Problem | Example |
|---|---|
| Higher pitch | Children's voices are 1-2 octaves higher than adults |
| Mispronunciation | "elephant" β "efant", "spaghetti" β "pasketti" |
| Different rhythm | Children speak with irregular pace |
| Incomplete words | "gonna" instead of "going to" |
| Speech disorders | Lisping, stuttering, articulation issues |
Our Solution: We took NVIDIA's powerful Parakeet TDT 0.6B model and added a specially trained adapter β a small but powerful neural network module that teaches the model to understand children's voices without forgetting what it already knows about adult speech.
π Competition Background
This model was built for the "On Top of Pasketti: Children's Speech Recognition Challenge" on DrivenData β a competition with a $120,000 prize pool to build the world's best children's speech recognition system.
Competition Results
| Model | Validation WER | Public Leaderboard WER |
|---|---|---|
| Our Parakeet Adapter | 10.64% π₯ | 29.26% |
| Competition #1 | - | 19.37% |
| Official Parakeet Baseline | - | 31.77% |
| Whisper Baseline | - | ~60%+ |
WER = Word Error Rate β Lower is always better! If a child says 10 words and model gets 2 wrong = 20% WER
π― Key Achievement
We beat the official baseline Parakeet model (31.77% β 29.26%) using only adapter fine-tuning with 0.26% trainable parameters!
π What's Inside This Repository
This repository contains one file β but it's everything you need!
ASR-Adapter.nemo (2.5 GB) β The Complete Model Package
A .nemo file is NVIDIA's special format that packages everything the model needs into a single file. Think of it like a USB drive containing all the model's knowledge.
What's packed inside this single file:
| Component | What it Does |
|---|---|
| Base Model Weights | The original Parakeet TDT 0.6B knowledge (619M parameters) |
| Adapter Weights | Our specially trained children's speech adapter (1.6M parameters) |
| Audio Preprocessor | Converts raw audio β mel spectrograms automatically |
| Tokenizer | Converts predicted tokens β readable English text |
| Model Config | Architecture settings (encoder layers, decoder type, etc.) |
| Decoding Config | How to generate the final text output |
The adapter we trained:
- Only 1,622,016 parameters trainable (0.26% of total!)
- Added as linear layers to each encoder layer
- Base model was frozen (its knowledge preserved)
- Adapter learned children's acoustic patterns
π How to Use This Model
Method 1 β Use Our Live Website (Easiest!)
No installation needed. Just go to: https://harphool17-parakeet-asr-competition-winner.hf.space/
Upload your audio file and get the transcription instantly! β
Method 2 β Use in Python Code
Step 1 β Install Requirements
# Install NeMo framework (required for Parakeet)
pip install nemo_toolkit[asr]
# Install audio processing libraries
pip install librosa soundfile
β οΈ Note: NeMo installation can take 5-10 minutes. Be patient!
Step 2 β Download the Model
from huggingface_hub import hf_hub_download
# Download the model (2.5GB β takes a few minutes)
model_path = hf_hub_download(
repo_id="harphool17/parakeet-asr-adapter",
filename="ASR-Adapter.nemo"
)
print(f"Model downloaded to: {model_path}")
Step 3 β Load and Use the Model
import torch
import librosa
import soundfile as sf
import numpy as np
from nemo.collections.asr.models import ASRModel
from omegaconf import open_dict
# ββ Load Model ββ
print("Loading model... (may take 1-2 minutes)")
model = ASRModel.restore_from(
model_path,
map_location="cuda" if torch.cuda.is_available() else "cpu"
)
# Disable CUDA graph decoder (compatibility fix)
with open_dict(model.cfg):
model.cfg.decoding.greedy.use_cuda_graph_decoder = False
model.change_decoding_strategy(model.cfg.decoding)
# Enable our trained adapter
if model.is_adapter_available():
model.set_enabled_adapters(enabled=True)
print("β
Adapter enabled!")
model.eval()
print("β
Model ready!")
# ββ Transcribe Audio ββ
def transcribe_audio(audio_path):
"""
Transcribe an audio file containing children's speech.
Args:
audio_path: Path to audio file (WAV, FLAC, MP3, etc.)
Returns:
Transcribed text string
"""
# Load and convert audio to 16kHz mono (required format)
audio, sr = sf.read(audio_path, dtype="float32")
# Convert stereo to mono if needed
if audio.ndim > 1:
audio = audio.mean(axis=1)
# Resample to 16kHz if needed
if sr != 16000:
audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
# Save as temporary WAV file
import tempfile, os
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
sf.write(f.name, audio, 16000)
temp_path = f.name
try:
# Run transcription
result = model.transcribe([temp_path], verbose=False)
# Extract text from result
if isinstance(result, tuple):
result = result[0]
text = result[0].text if hasattr(result[0], "text") else result[0]
return text.lower().strip()
finally:
os.unlink(temp_path)
# ββ Example Usage ββ
transcription = transcribe_audio("child_speaking.wav")
print(f"Child said: '{transcription}'")
Step 4 β Batch Processing (Multiple Files)
# Process multiple audio files at once (faster!)
audio_files = [
"child1.wav",
"child2.wav",
"child3.wav"
]
results = model.transcribe(audio_files, batch_size=8, verbose=False)
if isinstance(results, tuple):
results = results[0]
for audio_file, result in zip(audio_files, results):
text = result.text if hasattr(result, "text") else result
print(f"{audio_file}: {text}")
Supported Audio Formats
| Format | Extension | Quality | Notes |
|---|---|---|---|
| WAV | .wav |
β Best | Recommended format |
| FLAC | .flac |
β Best | Lossless audio |
| MP3 | .mp3 |
β Good | Lossy but works |
| OGG | .ogg |
β Good | Open format |
| M4A | .m4a |
β Good | Apple format |
Best practice: Use WAV or FLAC for highest accuracy. The code automatically converts any sample rate to 16kHz mono.
Common Errors and How to Fix Them
| Error | Cause | Fix |
|---|---|---|
ModuleNotFoundError: nemo |
NeMo not installed | pip install nemo_toolkit[asr] |
CUDA out of memory |
Not enough GPU memory | Use CPU: map_location="cpu" |
Channel selector average not found |
Multi-channel audio issue | Convert to mono first (code above does this) |
FileNotFoundError |
Wrong audio path | Check the file path is correct |
| Model loads but gives bad results | Adapter not enabled | Make sure model.set_enabled_adapters(enabled=True) runs |
π¬ Technical Details β How We Trained This Model
What is an Adapter?
Imagine the base Parakeet model is like a brilliant doctor who trained for 10 years on adult patients. We can't retrain all 10 years of knowledge (too expensive!). Instead, we gave this doctor a short specialized course on treating children β that's the adapter!
Base Parakeet Model (619M params) βββ FROZEN, not changed
+
Adapter Layers (1.6M params) βββ TRAINED on children's speech
=
Final Model (620M total, only 0.26% changed!)
Training Configuration
# Our exact training settings
BATCH_SIZE = 16 # Audio clips processed at once
LEARNING_RATE = 0.001 # How fast the adapter learns
MAX_STEPS = 8000 # Total training iterations
VAL_INTERVAL = 700 # Evaluate every 700 steps
PRECISION = "bf16-mixed" # Memory-efficient format
OPTIMIZER = "AdamW" # Learning algorithm
LR_SCHEDULE = "CosineAnnealing" # Learning rate decay
WARMUP_STEPS = 500 # Gradual LR warmup
Training Data
| Property | Value |
|---|---|
| Total utterances | 90,083 (after filtering) |
| Training set | 85,578 samples |
| Validation set | 4,505 samples |
| Total audio | ~148 hours |
| Age groups | 3-4, 5-7, 8-11 years |
| Max clip duration | 25 seconds |
| Sample rate | 16kHz mono |
| Source | DrivenData Pasketti Competition |
Training Progress
| Step | Validation WER | Notes |
|---|---|---|
| 700 | 12.90% | First evaluation |
| 1,400 | ~11.5% | Improving |
| 2,800 | ~11.0% | Steady improvement |
| 5,000 | ~10.8% | Continuing |
| 7,449 | 10.64% π₯ | Best model saved! |
| 8,000 | Training stopped | max_steps reached |
Hardware Used
| Component | Specification |
|---|---|
| GPU | NVIDIA RTX 4500 Ada Generation |
| VRAM | 24 GB |
| Training Time | ~3 hours |
| Framework | NVIDIA NeMo 2.7.2 |
| PyTorch | 2.6.0 + CUDA 12.4 |
π Understanding WER (Word Error Rate)
WER is the main metric for speech recognition. Here's how it works:
Child says: "I have two cats and a dog" (7 words)
Model predicts: "I have to cats in a dog" (7 words)
Errors:
- "two" β "to" = 1 Substitution
- "and" β "in" = 1 Substitution
WER = (Substitutions + Deletions + Insertions) / Total Reference Words
WER = (2 + 0 + 0) / 7 = 0.286 = 28.6%
Our scores:
- Validation WER: 10.64% β about 1 error per 10 words β
- Public test WER: 29.26% β competition test set (harder, unseen data)
ποΈ Architecture Details
This model uses Parakeet TDT (Token-and-Duration Transducer) architecture:
Audio Input (WAV file)
β
Audio Preprocessor (converts to 80-bin mel spectrogram)
β
Conformer Encoder (with our adapter layers) β Our adapter adds here!
β
RNN-T Decoder
β
Text Output ("the child said hello")
Why RNN-T instead of Whisper's approach?
- Whisper generates text one token at a time (slow for long audio)
- RNN-T generates text streaming (faster, more efficient)
- Parakeet can't output capital letters, punctuation, or digits β only lowercase a-z and spaces (perfect for WER evaluation!)
π Related Resources
| Resource | Link |
|---|---|
| π Live Demo Website | harphool17-parakeet-asr-competition-winner.hf.space |
| π» GitHub Code | harphool-singh/whisper-children-asr |
| π Competition | DrivenData Pasketti Challenge |
| π¦ Base Model | nvidia/parakeet-tdt-0.6b-v2 |
| π€ Whisper Model | harphool17/whisper-large-v3-children-asr |
π‘ Lessons Learned
For anyone wanting to build similar models:
Adapters are incredibly efficient β We trained only 0.26% of parameters and got excellent results. You don't always need to retrain everything!
Pre-compute features β Computing mel spectrograms during training slows everything down. Pre-compute once and save as files.
Library versions matter β NeMo 2.7.2 locally vs 2.5.x on server caused our adapter to not apply during inference. Always test on the same version you'll deploy!
Children's speech is genuinely hard β Even with fine-tuning, the gap between validation (10.6% WER) and public test (29.26%) shows how diverse children's speech really is.
Validation WER β Real Performance β Our model proved excellent in validation but faced distribution shift on the real test set. Always test on completely unseen data!
π€ About the Author
Harphool Singh β built this model as part of an NLP course project and DrivenData competition participation.
- π GitHub: @harphool-singh
- π€ HuggingFace: @harphool17
- π Demo: Try it live!
π License
This model is released under the MIT License β free to use, modify, and distribute for any purpose including commercial use.
The base model (Parakeet TDT 0.6B) is released by NVIDIA under their own license β please check NVIDIA's model page for terms.
Built with β€οΈ to make AI work better for children's education
- Downloads last month
- 26