🎙️ Parakeet TDT 0.6B — Fine-Tuned for Children's Speech Recognition

🌐 Try It Live — No Code Needed!

We built a free online demo where anyone can upload audio and get children's speech transcribed instantly!

👉 Click Here to Try the Demo

What you can do on the website:

📁 Upload any audio file (WAV, MP3, FLAC, OGG, M4A)
🎤 Record directly from your microphone
⚡ Get instant transcription results
⏱️ See processing time
🔄 Clear and try again

The website is completely free and requires no account or installation!

🧠 What is This Model?

Think of this model as a specialist teacher's assistant who has been specially trained to understand children's speech.

Most speech recognition systems fail on children's voices because:

Problem	Example
Higher pitch	Children's voices are 1-2 octaves higher than adults
Mispronunciation	"elephant" → "efant", "spaghetti" → "pasketti"
Different rhythm	Children speak with irregular pace
Incomplete words	"gonna" instead of "going to"
Speech disorders	Lisping, stuttering, articulation issues

Our Solution: We took NVIDIA's powerful Parakeet TDT 0.6B model and added a specially trained adapter — a small but powerful neural network module that teaches the model to understand children's voices without forgetting what it already knows about adult speech.

🏆 Competition Background

This model was built for the "On Top of Pasketti: Children's Speech Recognition Challenge" on DrivenData — a competition with a $120,000 prize pool to build the world's best children's speech recognition system.

Competition Results

Model	Validation WER	Public Leaderboard WER
Our Parakeet Adapter	10.64% 🔥	29.26%
Competition #1	-	19.37%
Official Parakeet Baseline	-	31.77%
Whisper Baseline	-	~60%+

WER = Word Error Rate — Lower is always better! If a child says 10 words and model gets 2 wrong = 20% WER

🎯 Key Achievement

We beat the official baseline Parakeet model (31.77% → 29.26%) using only adapter fine-tuning with 0.26% trainable parameters!

📁 What's Inside This Repository

This repository contains one file — but it's everything you need!

`ASR-Adapter.nemo` (2.5 GB) — The Complete Model Package

A .nemo file is NVIDIA's special format that packages everything the model needs into a single file. Think of it like a USB drive containing all the model's knowledge.

What's packed inside this single file:

Component	What it Does
Base Model Weights	The original Parakeet TDT 0.6B knowledge (619M parameters)
Adapter Weights	Our specially trained children's speech adapter (1.6M parameters)
Audio Preprocessor	Converts raw audio → mel spectrograms automatically
Tokenizer	Converts predicted tokens → readable English text
Model Config	Architecture settings (encoder layers, decoder type, etc.)
Decoding Config	How to generate the final text output

The adapter we trained:

Only 1,622,016 parameters trainable (0.26% of total!)
Added as linear layers to each encoder layer
Base model was frozen (its knowledge preserved)
Adapter learned children's acoustic patterns

🚀 How to Use This Model

Method 1 — Use Our Live Website (Easiest!)

No installation needed. Just go to: https://harphool17-parakeet-asr-competition-winner.hf.space/

Upload your audio file and get the transcription instantly! ✅

Method 2 — Use in Python Code

Step 1 — Install Requirements

# Install NeMo framework (required for Parakeet)
pip install nemo_toolkit[asr]

# Install audio processing libraries
pip install librosa soundfile

⚠️ Note: NeMo installation can take 5-10 minutes. Be patient!

Step 2 — Download the Model

from huggingface_hub import hf_hub_download

# Download the model (2.5GB — takes a few minutes)
model_path = hf_hub_download(
    repo_id="harphool17/parakeet-asr-adapter",
    filename="ASR-Adapter.nemo"
)
print(f"Model downloaded to: {model_path}")

Step 3 — Load and Use the Model

import torch
import librosa
import soundfile as sf
import numpy as np
from nemo.collections.asr.models import ASRModel
from omegaconf import open_dict

# ── Load Model ──
print("Loading model... (may take 1-2 minutes)")
model = ASRModel.restore_from(
    model_path,
    map_location="cuda" if torch.cuda.is_available() else "cpu"
)

# Disable CUDA graph decoder (compatibility fix)
with open_dict(model.cfg):
    model.cfg.decoding.greedy.use_cuda_graph_decoder = False
model.change_decoding_strategy(model.cfg.decoding)

# Enable our trained adapter
if model.is_adapter_available():
    model.set_enabled_adapters(enabled=True)
    print("✅ Adapter enabled!")

model.eval()
print("✅ Model ready!")

# ── Transcribe Audio ──
def transcribe_audio(audio_path):
    """
    Transcribe an audio file containing children's speech.
    
    Args:
        audio_path: Path to audio file (WAV, FLAC, MP3, etc.)
    
    Returns:
        Transcribed text string
    """
    # Load and convert audio to 16kHz mono (required format)
    audio, sr = sf.read(audio_path, dtype="float32")
    
    # Convert stereo to mono if needed
    if audio.ndim > 1:
        audio = audio.mean(axis=1)
    
    # Resample to 16kHz if needed
    if sr != 16000:
        audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
    
    # Save as temporary WAV file
    import tempfile, os
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        sf.write(f.name, audio, 16000)
        temp_path = f.name
    
    try:
        # Run transcription
        result = model.transcribe([temp_path], verbose=False)
        
        # Extract text from result
        if isinstance(result, tuple):
            result = result[0]
        text = result[0].text if hasattr(result[0], "text") else result[0]
        
        return text.lower().strip()
    finally:
        os.unlink(temp_path)

# ── Example Usage ──
transcription = transcribe_audio("child_speaking.wav")
print(f"Child said: '{transcription}'")

Step 4 — Batch Processing (Multiple Files)

# Process multiple audio files at once (faster!)
audio_files = [
    "child1.wav",
    "child2.wav", 
    "child3.wav"
]

results = model.transcribe(audio_files, batch_size=8, verbose=False)
if isinstance(results, tuple):
    results = results[0]

for audio_file, result in zip(audio_files, results):
    text = result.text if hasattr(result, "text") else result
    print(f"{audio_file}: {text}")

Supported Audio Formats

Format	Extension	Quality	Notes
WAV	`.wav`	⭐ Best	Recommended format
FLAC	`.flac`	⭐ Best	Lossless audio
MP3	`.mp3`	✅ Good	Lossy but works
OGG	`.ogg`	✅ Good	Open format
M4A	`.m4a`	✅ Good	Apple format

Best practice: Use WAV or FLAC for highest accuracy. The code automatically converts any sample rate to 16kHz mono.

Common Errors and How to Fix Them

Error	Cause	Fix
`ModuleNotFoundError: nemo`	NeMo not installed	`pip install nemo_toolkit[asr]`
`CUDA out of memory`	Not enough GPU memory	Use CPU: `map_location="cpu"`
`Channel selector average not found`	Multi-channel audio issue	Convert to mono first (code above does this)
`FileNotFoundError`	Wrong audio path	Check the file path is correct
Model loads but gives bad results	Adapter not enabled	Make sure `model.set_enabled_adapters(enabled=True)` runs

🔬 Technical Details — How We Trained This Model

What is an Adapter?

Imagine the base Parakeet model is like a brilliant doctor who trained for 10 years on adult patients. We can't retrain all 10 years of knowledge (too expensive!). Instead, we gave this doctor a short specialized course on treating children — that's the adapter!

Base Parakeet Model (619M params) ←── FROZEN, not changed
         +
Adapter Layers (1.6M params)     ←── TRAINED on children's speech
         =
Final Model (620M total, only 0.26% changed!)

Training Configuration

# Our exact training settings
BATCH_SIZE    = 16      # Audio clips processed at once
LEARNING_RATE = 0.001   # How fast the adapter learns
MAX_STEPS     = 8000    # Total training iterations
VAL_INTERVAL  = 700     # Evaluate every 700 steps
PRECISION     = "bf16-mixed"  # Memory-efficient format
OPTIMIZER     = "AdamW"       # Learning algorithm
LR_SCHEDULE   = "CosineAnnealing"  # Learning rate decay
WARMUP_STEPS  = 500     # Gradual LR warmup

Training Data

Property	Value
Total utterances	90,083 (after filtering)
Training set	85,578 samples
Validation set	4,505 samples
Total audio	~148 hours
Age groups	3-4, 5-7, 8-11 years
Max clip duration	25 seconds
Sample rate	16kHz mono
Source	DrivenData Pasketti Competition

Training Progress

Step	Validation WER	Notes
700	12.90%	First evaluation
1,400	~11.5%	Improving
2,800	~11.0%	Steady improvement
5,000	~10.8%	Continuing
7,449	10.64% 🔥	Best model saved!
8,000	Training stopped	`max_steps` reached

Hardware Used

Component	Specification
GPU	NVIDIA RTX 4500 Ada Generation
VRAM	24 GB
Training Time	~3 hours
Framework	NVIDIA NeMo 2.7.2
PyTorch	2.6.0 + CUDA 12.4

📊 Understanding WER (Word Error Rate)

WER is the main metric for speech recognition. Here's how it works:

Child says:    "I have two cats and a dog"    (7 words)
Model predicts: "I have to cats in a dog"     (7 words)

Errors:
- "two" → "to"     = 1 Substitution
- "and" → "in"     = 1 Substitution

WER = (Substitutions + Deletions + Insertions) / Total Reference Words
WER = (2 + 0 + 0) / 7 = 0.286 = 28.6%

Our scores:

Validation WER: 10.64% — about 1 error per 10 words ✅
Public test WER: 29.26% — competition test set (harder, unseen data)

🏗️ Architecture Details

This model uses Parakeet TDT (Token-and-Duration Transducer) architecture:

Audio Input (WAV file)
       ↓
Audio Preprocessor (converts to 80-bin mel spectrogram)
       ↓
Conformer Encoder (with our adapter layers) ← Our adapter adds here!
       ↓
RNN-T Decoder
       ↓
Text Output ("the child said hello")

Why RNN-T instead of Whisper's approach?

Whisper generates text one token at a time (slow for long audio)
RNN-T generates text streaming (faster, more efficient)
Parakeet can't output capital letters, punctuation, or digits — only lowercase a-z and spaces (perfect for WER evaluation!)

🔗 Related Resources

Resource	Link
🌐 Live Demo Website	harphool17-parakeet-asr-competition-winner.hf.space
💻 GitHub Code	harphool-singh/whisper-children-asr
🏆 Competition	DrivenData Pasketti Challenge
📦 Base Model	nvidia/parakeet-tdt-0.6b-v2
🤗 Whisper Model	harphool17/whisper-large-v3-children-asr

💡 Lessons Learned

For anyone wanting to build similar models:

Adapters are incredibly efficient — We trained only 0.26% of parameters and got excellent results. You don't always need to retrain everything!
Pre-compute features — Computing mel spectrograms during training slows everything down. Pre-compute once and save as files.
Library versions matter — NeMo 2.7.2 locally vs 2.5.x on server caused our adapter to not apply during inference. Always test on the same version you'll deploy!
Children's speech is genuinely hard — Even with fine-tuning, the gap between validation (10.6% WER) and public test (29.26%) shows how diverse children's speech really is.
Validation WER ≠ Real Performance — Our model proved excellent in validation but faced distribution shift on the real test set. Always test on completely unseen data!

👤 About the Author

Harphool Singh — built this model as part of an NLP course project and DrivenData competition participation.

🐙 GitHub: @harphool-singh
🤗 HuggingFace: @harphool17
🌐 Demo: Try it live!

📄 License

This model is released under the MIT License — free to use, modify, and distribute for any purpose including commercial use.

The base model (Parakeet TDT 0.6B) is released by NVIDIA under their own license — please check NVIDIA's model page for terms.

Built with ❤️ to make AI work better for children's education

Downloads last month: 26

harphool17
/

parakeet-asr-adapter

🎙️ Parakeet TDT 0.6B — Fine-Tuned for Children's Speech Recognition

🌐 Try It Live — No Code Needed!

👉 Click Here to Try the Demo

🧠 What is This Model?

🏆 Competition Background

Competition Results

🎯 Key Achievement

📁 What's Inside This Repository

`ASR-Adapter.nemo` (2.5 GB) — The Complete Model Package

🚀 How to Use This Model

Method 1 — Use Our Live Website (Easiest!)

Method 2 — Use in Python Code

Step 1 — Install Requirements

Step 2 — Download the Model

Step 3 — Load and Use the Model

Step 4 — Batch Processing (Multiple Files)

Supported Audio Formats

Common Errors and How to Fix Them

🔬 Technical Details — How We Trained This Model

What is an Adapter?

Training Configuration

Training Data

Training Progress

Hardware Used

📊 Understanding WER (Word Error Rate)

🏗️ Architecture Details

🔗 Related Resources

💡 Lessons Learned

👤 About the Author

📄 License

Space using harphool17/parakeet-asr-adapter 1

🎙️ Parakeet TDT 0.6B — Fine-Tuned for Children's Speech Recognition

🌐 Try It Live — No Code Needed!

👉 Click Here to Try the Demo

🧠 What is This Model?

🏆 Competition Background

Competition Results

🎯 Key Achievement

📁 What's Inside This Repository

ASR-Adapter.nemo (2.5 GB) — The Complete Model Package

🚀 How to Use This Model

Method 1 — Use Our Live Website (Easiest!)

Method 2 — Use in Python Code

Step 1 — Install Requirements

Step 2 — Download the Model

Step 3 — Load and Use the Model

Step 4 — Batch Processing (Multiple Files)

Supported Audio Formats

Common Errors and How to Fix Them

🔬 Technical Details — How We Trained This Model

What is an Adapter?

Training Configuration

Training Data

Training Progress

Hardware Used

📊 Understanding WER (Word Error Rate)

🏗️ Architecture Details

🔗 Related Resources

💡 Lessons Learned

👤 About the Author

📄 License

Space using harphool17/parakeet-asr-adapter 1

`ASR-Adapter.nemo` (2.5 GB) — The Complete Model Package