PicoChat / README.md

Update README.md

52ca1e3 verified 4 months ago

3.37 kB

	---
	language:
	- en
	license: mit
	library_name: picochat
	tags:
	- pytorch
	- mps
	- macbook
	- education
	datasets:
	- HuggingFaceFW/fineweb-edu
	pipeline_tag: text-generation
	inference: false
	spaces:
	- MGow/PicoChat
	---

	# PicoChat

	PicoChat is a 335M parameter language model trained entirely from scratch on a MacBook Air M2 (16GB RAM) in approximately 6 days. The code is based on Andrej Karpathy's
	[NanoChat](https://github.com/karpathy/nanochat) and was updated to run at M2 MacBook Air at [PicoChat](https://github.com/MichalGow/PicoChat).
	It serves as a "lab notebook" proof-of-concept for training capable small language models (SLMs) on consumer hardware using pure PyTorch and MPS (Metal Performance Shaders).

	> Links:
	> - Space: https://huggingface.co/spaces/MGow/PicoChat

	## Model Details

	- Architecture: GPT-style Transformer (Decoder-only)
	- Parameters: ~335 Million
	- Layers: 16
	- Embedding Dimension: 1024
	- Heads: 8 Query heads, 8 KV heads (GQA)
	- Context Length: 1024 tokens
	- Vocabulary: 65,536 (Custom BPE)
	- Training Data: ~377 Million tokens
	- Precision: Trained in mixed precision (bfloat16/float32) on MPS.

	### Key Features
	- Rotary Embeddings (RoPE) (No absolute positional embeddings)
	- SwiGLU (ReLU²) activations
	- RMSNorm (with no learnable parameters)
	- Untied embeddings (Input and Output embeddings are separate matrices)
	- Grouped Query Attention (GQA) supported (configured here as 1:1 for simplicity)

	## Training Recipe

	The model was trained in three phases using the [nanochat](https://github.com/karpathy/nanochat) framework, adapted for macOS:

	1. Base Pretraining (~5 days):
	- Data: [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (100B subset, shuffled)
	- Steps: ~60,000
	- Tokens: ~442M
	- Objective: Next token prediction

	2. Midtraining (~16 hours):
	- Data: Mixed pretraining data + synthetic conversation/instruction formats.
	- Tokens: ~33M
	- Objective: Adaptation to chat format and Q&A style.

	3. Supervised Finetuning (SFT) (~4 hours):
	- Data Mixture:
	- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (10k examples)
	- [GSM8K](https://huggingface.co/datasets/gsm8k) (8k examples)
	- [ARC-Easy & Challenge](https://huggingface.co/datasets/ai2_arc) (~3.4k examples)
	- Identity & Synthetic Spelling tasks (~1.6k examples)
	- Steps: 1,000 (Batch size 8)
	- Tokens: ~1M
	- Objective: Instruction following and personality alignment.

	## Character & Limitations

	- Personality: The model is designed to be "a bit silly, often wrong, sometimes delightful." It is not a rigid assistant but a fun research artifact.
	- Hallucinations: As a small 300M model trained on limited data, it will confidently hallucinate facts.
	- Context Window: Limited to 1024 tokens.
	- Safety: The model has not gone through extensive safety alignment or RLHF. It generally behaves like a base model with some instruction following capabilities.

	## Usage

	This model requires the [picochat](https://github.com/MichalGow/PicoChat) library to run, as it uses a custom architecture implementation optimized for educational clarity and hackability.

	## License

	MIT

	---
	language:
	- en
	license: mit
	library_name: picochat
	tags:
	- pytorch
	- mps
	- macbook
	- education
	datasets:
	- HuggingFaceFW/fineweb-edu
	pipeline_tag: text-generation
	inference: false
	spaces:
	- MGow/PicoChat
	---

	# PicoChat

	PicoChat is a 335M parameter language model trained entirely from scratch on a MacBook Air M2 (16GB RAM) in approximately 6 days. The code is based on Andrej Karpathy's
	[NanoChat](https://github.com/karpathy/nanochat) and was updated to run at M2 MacBook Air at [PicoChat](https://github.com/MichalGow/PicoChat).
	It serves as a "lab notebook" proof-of-concept for training capable small language models (SLMs) on consumer hardware using pure PyTorch and MPS (Metal Performance Shaders).

	> Links:
	> - Space: https://huggingface.co/spaces/MGow/PicoChat

	## Model Details

	- Architecture: GPT-style Transformer (Decoder-only)
	- Parameters: ~335 Million
	- Layers: 16
	- Embedding Dimension: 1024
	- Heads: 8 Query heads, 8 KV heads (GQA)
	- Context Length: 1024 tokens
	- Vocabulary: 65,536 (Custom BPE)
	- Training Data: ~377 Million tokens
	- Precision: Trained in mixed precision (bfloat16/float32) on MPS.

	### Key Features
	- Rotary Embeddings (RoPE) (No absolute positional embeddings)
	- SwiGLU (ReLU²) activations
	- RMSNorm (with no learnable parameters)
	- Untied embeddings (Input and Output embeddings are separate matrices)
	- Grouped Query Attention (GQA) supported (configured here as 1:1 for simplicity)

	## Training Recipe

	The model was trained in three phases using the [nanochat](https://github.com/karpathy/nanochat) framework, adapted for macOS:

	1. Base Pretraining (~5 days):
	- Data: [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (100B subset, shuffled)
	- Steps: ~60,000
	- Tokens: ~442M
	- Objective: Next token prediction

	2. Midtraining (~16 hours):
	- Data: Mixed pretraining data + synthetic conversation/instruction formats.
	- Tokens: ~33M
	- Objective: Adaptation to chat format and Q&A style.

	3. Supervised Finetuning (SFT) (~4 hours):
	- Data Mixture:
	- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (10k examples)
	- [GSM8K](https://huggingface.co/datasets/gsm8k) (8k examples)
	- [ARC-Easy & Challenge](https://huggingface.co/datasets/ai2_arc) (~3.4k examples)
	- Identity & Synthetic Spelling tasks (~1.6k examples)
	- Steps: 1,000 (Batch size 8)
	- Tokens: ~1M
	- Objective: Instruction following and personality alignment.

	## Character & Limitations

	- Personality: The model is designed to be "a bit silly, often wrong, sometimes delightful." It is not a rigid assistant but a fun research artifact.
	- Hallucinations: As a small 300M model trained on limited data, it will confidently hallucinate facts.
	- Context Window: Limited to 1024 tokens.
	- Safety: The model has not gone through extensive safety alignment or RLHF. It generally behaves like a base model with some instruction following capabilities.

	## Usage

	This model requires the [picochat](https://github.com/MichalGow/PicoChat) library to run, as it uses a custom architecture implementation optimized for educational clarity and hackability.

	## License

	MIT