| --- |
| language: |
| - en |
| license: mit |
| library_name: picochat |
| tags: |
| - pytorch |
| - mps |
| - macbook |
| - education |
| datasets: |
| - HuggingFaceFW/fineweb-edu |
| pipeline_tag: text-generation |
| inference: false |
| spaces: |
| - MGow/PicoChat |
| --- |
| |
| # PicoChat |
|
|
| **PicoChat** is a 335M parameter language model trained entirely from scratch on a MacBook Air M2 (16GB RAM) in approximately 6 days. The code is based on Andrej Karpathy's |
| [NanoChat](https://github.com/karpathy/nanochat) and was updated to run at M2 MacBook Air at [PicoChat](https://github.com/MichalGow/PicoChat). |
| It serves as a "lab notebook" proof-of-concept for training capable small language models (SLMs) on consumer hardware using pure PyTorch and MPS (Metal Performance Shaders). |
|
|
| > **Links:** |
| > - **Space:** https://huggingface.co/spaces/MGow/PicoChat |
|
|
| ## Model Details |
|
|
| - **Architecture:** GPT-style Transformer (Decoder-only) |
| - **Parameters:** ~335 Million |
| - **Layers:** 16 |
| - **Embedding Dimension:** 1024 |
| - **Heads:** 8 Query heads, 8 KV heads (GQA) |
| - **Context Length:** 1024 tokens |
| - **Vocabulary:** 65,536 (Custom BPE) |
| - **Training Data:** ~377 Million tokens |
| - **Precision:** Trained in mixed precision (bfloat16/float32) on MPS. |
|
|
| ### Key Features |
| - **Rotary Embeddings (RoPE)** (No absolute positional embeddings) |
| - **SwiGLU** (ReLU²) activations |
| - **RMSNorm** (with no learnable parameters) |
| - **Untied embeddings** (Input and Output embeddings are separate matrices) |
| - **Grouped Query Attention (GQA)** supported (configured here as 1:1 for simplicity) |
|
|
| ## Training Recipe |
|
|
| The model was trained in three phases using the [nanochat](https://github.com/karpathy/nanochat) framework, adapted for macOS: |
|
|
| 1. **Base Pretraining (~5 days):** |
| - **Data:** [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (100B subset, shuffled) |
| - **Steps:** ~60,000 |
| - **Tokens:** ~442M |
| - **Objective:** Next token prediction |
|
|
| 2. **Midtraining (~16 hours):** |
| - **Data:** Mixed pretraining data + synthetic conversation/instruction formats. |
| - **Tokens:** ~33M |
| - **Objective:** Adaptation to chat format and Q&A style. |
|
|
| 3. **Supervised Finetuning (SFT) (~4 hours):** |
| - **Data Mixture:** |
| - [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (10k examples) |
| - [GSM8K](https://huggingface.co/datasets/gsm8k) (8k examples) |
| - [ARC-Easy & Challenge](https://huggingface.co/datasets/ai2_arc) (~3.4k examples) |
| - Identity & Synthetic Spelling tasks (~1.6k examples) |
| - **Steps:** 1,000 (Batch size 8) |
| - **Tokens:** ~1M |
| - **Objective:** Instruction following and personality alignment. |
|
|
| ## Character & Limitations |
|
|
| - **Personality:** The model is designed to be "a bit silly, often wrong, sometimes delightful." It is not a rigid assistant but a fun research artifact. |
| - **Hallucinations:** As a small 300M model trained on limited data, it will confidently hallucinate facts. |
| - **Context Window:** Limited to 1024 tokens. |
| - **Safety:** The model has not gone through extensive safety alignment or RLHF. It generally behaves like a base model with some instruction following capabilities. |
|
|
| ## Usage |
|
|
| This model requires the [picochat](https://github.com/MichalGow/PicoChat) library to run, as it uses a custom architecture implementation optimized for educational clarity and hackability. |
|
|
| ## License |
|
|
| MIT |
|
|