πŸŽ™οΈ Parakeet TDT 0.6B β€” Fine-Tuned for Children's Speech Recognition

Live Demo Competition Model License


🌐 Try It Live β€” No Code Needed!

We built a free online demo where anyone can upload audio and get children's speech transcribed instantly!

πŸ‘‰ Click Here to Try the Demo

What you can do on the website:

  • πŸ“ Upload any audio file (WAV, MP3, FLAC, OGG, M4A)
  • 🎀 Record directly from your microphone
  • ⚑ Get instant transcription results
  • ⏱️ See processing time
  • πŸ”„ Clear and try again

The website is completely free and requires no account or installation!


🧠 What is This Model?

Think of this model as a specialist teacher's assistant who has been specially trained to understand children's speech.

Most speech recognition systems fail on children's voices because:

Problem Example
Higher pitch Children's voices are 1-2 octaves higher than adults
Mispronunciation "elephant" β†’ "efant", "spaghetti" β†’ "pasketti"
Different rhythm Children speak with irregular pace
Incomplete words "gonna" instead of "going to"
Speech disorders Lisping, stuttering, articulation issues

Our Solution: We took NVIDIA's powerful Parakeet TDT 0.6B model and added a specially trained adapter β€” a small but powerful neural network module that teaches the model to understand children's voices without forgetting what it already knows about adult speech.


πŸ† Competition Background

This model was built for the "On Top of Pasketti: Children's Speech Recognition Challenge" on DrivenData β€” a competition with a $120,000 prize pool to build the world's best children's speech recognition system.

Competition Results

Model Validation WER Public Leaderboard WER
Our Parakeet Adapter 10.64% πŸ”₯ 29.26%
Competition #1 - 19.37%
Official Parakeet Baseline - 31.77%
Whisper Baseline - ~60%+

WER = Word Error Rate β€” Lower is always better! If a child says 10 words and model gets 2 wrong = 20% WER

🎯 Key Achievement

We beat the official baseline Parakeet model (31.77% β†’ 29.26%) using only adapter fine-tuning with 0.26% trainable parameters!


πŸ“ What's Inside This Repository

This repository contains one file β€” but it's everything you need!

ASR-Adapter.nemo (2.5 GB) β€” The Complete Model Package

A .nemo file is NVIDIA's special format that packages everything the model needs into a single file. Think of it like a USB drive containing all the model's knowledge.

What's packed inside this single file:

Component What it Does
Base Model Weights The original Parakeet TDT 0.6B knowledge (619M parameters)
Adapter Weights Our specially trained children's speech adapter (1.6M parameters)
Audio Preprocessor Converts raw audio β†’ mel spectrograms automatically
Tokenizer Converts predicted tokens β†’ readable English text
Model Config Architecture settings (encoder layers, decoder type, etc.)
Decoding Config How to generate the final text output

The adapter we trained:

  • Only 1,622,016 parameters trainable (0.26% of total!)
  • Added as linear layers to each encoder layer
  • Base model was frozen (its knowledge preserved)
  • Adapter learned children's acoustic patterns

πŸš€ How to Use This Model

Method 1 β€” Use Our Live Website (Easiest!)

No installation needed. Just go to: https://harphool17-parakeet-asr-competition-winner.hf.space/

Upload your audio file and get the transcription instantly! βœ…


Method 2 β€” Use in Python Code

Step 1 β€” Install Requirements

# Install NeMo framework (required for Parakeet)
pip install nemo_toolkit[asr]

# Install audio processing libraries
pip install librosa soundfile

⚠️ Note: NeMo installation can take 5-10 minutes. Be patient!

Step 2 β€” Download the Model

from huggingface_hub import hf_hub_download

# Download the model (2.5GB β€” takes a few minutes)
model_path = hf_hub_download(
    repo_id="harphool17/parakeet-asr-adapter",
    filename="ASR-Adapter.nemo"
)
print(f"Model downloaded to: {model_path}")

Step 3 β€” Load and Use the Model

import torch
import librosa
import soundfile as sf
import numpy as np
from nemo.collections.asr.models import ASRModel
from omegaconf import open_dict

# ── Load Model ──
print("Loading model... (may take 1-2 minutes)")
model = ASRModel.restore_from(
    model_path,
    map_location="cuda" if torch.cuda.is_available() else "cpu"
)

# Disable CUDA graph decoder (compatibility fix)
with open_dict(model.cfg):
    model.cfg.decoding.greedy.use_cuda_graph_decoder = False
model.change_decoding_strategy(model.cfg.decoding)

# Enable our trained adapter
if model.is_adapter_available():
    model.set_enabled_adapters(enabled=True)
    print("βœ… Adapter enabled!")

model.eval()
print("βœ… Model ready!")

# ── Transcribe Audio ──
def transcribe_audio(audio_path):
    """
    Transcribe an audio file containing children's speech.
    
    Args:
        audio_path: Path to audio file (WAV, FLAC, MP3, etc.)
    
    Returns:
        Transcribed text string
    """
    # Load and convert audio to 16kHz mono (required format)
    audio, sr = sf.read(audio_path, dtype="float32")
    
    # Convert stereo to mono if needed
    if audio.ndim > 1:
        audio = audio.mean(axis=1)
    
    # Resample to 16kHz if needed
    if sr != 16000:
        audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
    
    # Save as temporary WAV file
    import tempfile, os
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        sf.write(f.name, audio, 16000)
        temp_path = f.name
    
    try:
        # Run transcription
        result = model.transcribe([temp_path], verbose=False)
        
        # Extract text from result
        if isinstance(result, tuple):
            result = result[0]
        text = result[0].text if hasattr(result[0], "text") else result[0]
        
        return text.lower().strip()
    finally:
        os.unlink(temp_path)

# ── Example Usage ──
transcription = transcribe_audio("child_speaking.wav")
print(f"Child said: '{transcription}'")

Step 4 β€” Batch Processing (Multiple Files)

# Process multiple audio files at once (faster!)
audio_files = [
    "child1.wav",
    "child2.wav", 
    "child3.wav"
]

results = model.transcribe(audio_files, batch_size=8, verbose=False)
if isinstance(results, tuple):
    results = results[0]

for audio_file, result in zip(audio_files, results):
    text = result.text if hasattr(result, "text") else result
    print(f"{audio_file}: {text}")

Supported Audio Formats

Format Extension Quality Notes
WAV .wav ⭐ Best Recommended format
FLAC .flac ⭐ Best Lossless audio
MP3 .mp3 βœ… Good Lossy but works
OGG .ogg βœ… Good Open format
M4A .m4a βœ… Good Apple format

Best practice: Use WAV or FLAC for highest accuracy. The code automatically converts any sample rate to 16kHz mono.


Common Errors and How to Fix Them

Error Cause Fix
ModuleNotFoundError: nemo NeMo not installed pip install nemo_toolkit[asr]
CUDA out of memory Not enough GPU memory Use CPU: map_location="cpu"
Channel selector average not found Multi-channel audio issue Convert to mono first (code above does this)
FileNotFoundError Wrong audio path Check the file path is correct
Model loads but gives bad results Adapter not enabled Make sure model.set_enabled_adapters(enabled=True) runs

πŸ”¬ Technical Details β€” How We Trained This Model

What is an Adapter?

Imagine the base Parakeet model is like a brilliant doctor who trained for 10 years on adult patients. We can't retrain all 10 years of knowledge (too expensive!). Instead, we gave this doctor a short specialized course on treating children β€” that's the adapter!

Base Parakeet Model (619M params) ←── FROZEN, not changed
         +
Adapter Layers (1.6M params)     ←── TRAINED on children's speech
         =
Final Model (620M total, only 0.26% changed!)

Training Configuration

# Our exact training settings
BATCH_SIZE    = 16      # Audio clips processed at once
LEARNING_RATE = 0.001   # How fast the adapter learns
MAX_STEPS     = 8000    # Total training iterations
VAL_INTERVAL  = 700     # Evaluate every 700 steps
PRECISION     = "bf16-mixed"  # Memory-efficient format
OPTIMIZER     = "AdamW"       # Learning algorithm
LR_SCHEDULE   = "CosineAnnealing"  # Learning rate decay
WARMUP_STEPS  = 500     # Gradual LR warmup

Training Data

Property Value
Total utterances 90,083 (after filtering)
Training set 85,578 samples
Validation set 4,505 samples
Total audio ~148 hours
Age groups 3-4, 5-7, 8-11 years
Max clip duration 25 seconds
Sample rate 16kHz mono
Source DrivenData Pasketti Competition

Training Progress

Step Validation WER Notes
700 12.90% First evaluation
1,400 ~11.5% Improving
2,800 ~11.0% Steady improvement
5,000 ~10.8% Continuing
7,449 10.64% πŸ”₯ Best model saved!
8,000 Training stopped max_steps reached

Hardware Used

Component Specification
GPU NVIDIA RTX 4500 Ada Generation
VRAM 24 GB
Training Time ~3 hours
Framework NVIDIA NeMo 2.7.2
PyTorch 2.6.0 + CUDA 12.4

πŸ“Š Understanding WER (Word Error Rate)

WER is the main metric for speech recognition. Here's how it works:

Child says:    "I have two cats and a dog"    (7 words)
Model predicts: "I have to cats in a dog"     (7 words)

Errors:
- "two" β†’ "to"     = 1 Substitution
- "and" β†’ "in"     = 1 Substitution

WER = (Substitutions + Deletions + Insertions) / Total Reference Words
WER = (2 + 0 + 0) / 7 = 0.286 = 28.6%

Our scores:

  • Validation WER: 10.64% β€” about 1 error per 10 words βœ…
  • Public test WER: 29.26% β€” competition test set (harder, unseen data)

πŸ—οΈ Architecture Details

This model uses Parakeet TDT (Token-and-Duration Transducer) architecture:

Audio Input (WAV file)
       ↓
Audio Preprocessor (converts to 80-bin mel spectrogram)
       ↓
Conformer Encoder (with our adapter layers) ← Our adapter adds here!
       ↓
RNN-T Decoder
       ↓
Text Output ("the child said hello")

Why RNN-T instead of Whisper's approach?

  • Whisper generates text one token at a time (slow for long audio)
  • RNN-T generates text streaming (faster, more efficient)
  • Parakeet can't output capital letters, punctuation, or digits β€” only lowercase a-z and spaces (perfect for WER evaluation!)

πŸ”— Related Resources

Resource Link
🌐 Live Demo Website harphool17-parakeet-asr-competition-winner.hf.space
πŸ’» GitHub Code harphool-singh/whisper-children-asr
πŸ† Competition DrivenData Pasketti Challenge
πŸ“¦ Base Model nvidia/parakeet-tdt-0.6b-v2
πŸ€— Whisper Model harphool17/whisper-large-v3-children-asr

πŸ’‘ Lessons Learned

For anyone wanting to build similar models:

  1. Adapters are incredibly efficient β€” We trained only 0.26% of parameters and got excellent results. You don't always need to retrain everything!

  2. Pre-compute features β€” Computing mel spectrograms during training slows everything down. Pre-compute once and save as files.

  3. Library versions matter β€” NeMo 2.7.2 locally vs 2.5.x on server caused our adapter to not apply during inference. Always test on the same version you'll deploy!

  4. Children's speech is genuinely hard β€” Even with fine-tuning, the gap between validation (10.6% WER) and public test (29.26%) shows how diverse children's speech really is.

  5. Validation WER β‰  Real Performance β€” Our model proved excellent in validation but faced distribution shift on the real test set. Always test on completely unseen data!


πŸ‘€ About the Author

Harphool Singh β€” built this model as part of an NLP course project and DrivenData competition participation.


πŸ“„ License

This model is released under the MIT License β€” free to use, modify, and distribute for any purpose including commercial use.

The base model (Parakeet TDT 0.6B) is released by NVIDIA under their own license β€” please check NVIDIA's model page for terms.


Built with ❀️ to make AI work better for children's education

Downloads last month
26
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using harphool17/parakeet-asr-adapter 1