Tiny-LLM 54M
A small transformer language model (~54.93M parameters) trained from scratch for educational and experimental purposes.
Model Description
This is a decoder-only transformer trained from scratch on Wikipedia text. It demonstrates that meaningful language models can be trained on consumer hardware with modest compute budgets.
Architecture
| Component | Value |
|---|---|
| Parameters | 54.93M |
| Layers | 12 |
| Hidden Size | 512 |
| Attention Heads | 8 |
| Intermediate (FFN) | 1408 |
| Vocab Size | 32,000 |
| Max Sequence Length | 512 |
| Position Encoding | RoPE |
| Normalization | RMSNorm |
| Activation | SwiGLU |
| Weight Tying | Yes |
Training Details
| Parameter | Value |
|---|---|
| Training Steps | 50,000 |
| Tokens | ~100M |
| Batch Size | 32 |
| Learning Rate | 3e-4 |
| Warmup Steps | 2,000 |
| Weight Decay | 0.1 |
| Hardware | NVIDIA RTX 5090 (32GB) |
| Training Time | ~3 hours |
Usage
import torch
from transformers import AutoTokenizer
# Load tokenizer (uses standard GPT-2 style tokenizer)
tokenizer = AutoTokenizer.from_pretrained("jonmabe/tiny-llm-54m")
# For custom model loading, see the model files
# This model uses a custom architecture - see scripts/ for inference code
Generation Example
# Note: This model uses a custom architecture
# Full inference code available in the repository
prompt = "The history of artificial intelligence"
# Model generates continuation based on learned Wikipedia patterns
Intended Use
- Educational: Understanding transformer training from scratch
- Experimental: Testing fine-tuning approaches on small models
- Personal LLM: Base for personal voice/style fine-tuning
- Research: Lightweight model for NLP experiments
Limitations
- Small model size limits knowledge and capabilities
- Trained only on Wikipedia - limited domain coverage
- Not suitable for production use cases requiring high quality
- May generate factually incorrect information
- No RLHF or instruction tuning
Training Data
- Source: Wikipedia (English)
- Processing: Tokenized with 32K vocabulary SentencePiece tokenizer
- Format: Standard causal language modeling (next token prediction)
Future Work
This model is intended as a base for:
- Personal Fine-tuning: Adapt to individual writing style using personal data
- Domain Adaptation: Specialize for specific topics or tasks
- Instruction Tuning: Add instruction-following capabilities
Hardware Requirements
- Inference: ~300MB GPU memory, runs on any modern GPU or Apple Silicon
- Fine-tuning: ~2GB GPU memory recommended
Related Work
Inspired by:
- Andrej Karpathy's nanoGPT
- Geddy Duke's small LLM experiments
- LLaMA architecture design choices
Citation
@misc{tiny-llm-54m,
author = {jonmabe},
title = {Tiny-LLM: A 54M Parameter Language Model},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/jonmabe/tiny-llm-54m}
}