Julian 600M-40B Instruct SFT-30K
Fine-tuned version of the Julian 600M base model with 30,000 steps of supervised fine-tuning (SFT) on instruction-following data.
Model Details
| Parameter | Value |
|---|---|
| Architecture | LlamaForCausalLM (decoder-only) |
| Parameters | ~600M |
| Hidden size | 1280 |
| Layers | 18 |
| Attention heads | 20 |
| FFN size | 5120 (SwiGLU) |
| Vocab size | 50,000 (SentencePiece) |
| Context length | 2048 |
| Precision | bfloat16 |
| Norm | RMSNorm |
| Position encoding | RoPE |
Benchmark Results (0-shot)
| Benchmark | Base (39B tokens) | SFT 30K | SFT 100K |
|---|---|---|---|
| HellaSwag (acc_norm) | 53.5% | 41.7% | 41.6% |
| PIQA (acc) | 66.8% | 66.8% | 66.6% |
| LAMBADA (acc) | 37.3% | 37.7% | 37.7% |
| LAMBADA (ppl↓) | — | 15.38 | 15.33 |
| ARC Easy (acc) | — | 53.5% | 53.8% |
| ARC Challenge (acc_norm) | — | 27.1% | 26.7% |
| WinoGrande (acc) | — | 53.8% | 52.8% |
| BoolQ (acc) | — | 60.6% | 60.8% |
SFT 30K and 100K yield near-identical benchmark scores. 30K steps is the sweet spot — additional SFT steps don't improve knowledge benchmarks and WinoGrande starts to degrade (likely overfitting).
Comparison with Other Models
| Model | Params | Tokens | HellaSwag | PIQA | LAMBADA | ARC-E | ARC-C | WinoGrande |
|---|---|---|---|---|---|---|---|---|
| GPT-2 Small | 124M | 100B+ | 31.5% | — | 46.0% | — | — | 50.4% |
| OPT-125M | 125M | 300B | 29.2% | 63.0% | 37.9% | 43.5% | 18.9% | 50.3% |
| OPT-350M | 331M | 300B | 32.0% | 64.4% | 45.2% | 44.0% | 20.7% | 52.3% |
| Pythia-410M | 405M | 300B | 33.3% | 66.8% | 50.5% | 50.4% | 21.3% | 53.0% |
| Julian 600M SFT-30K | 600M | 39B+2B | 41.7% | 66.8% | 37.7% | 53.5% | 27.1% | 53.8% |
| Julian 600M Base | 600M | 39B | 53.5% | 66.8% | 37.3% | — | — | — |
| GPT-2 XL | 1558M | 100B+ | 50.9% | 70.8% | 63.2% | — | — | 59.4% |
| Pythia-1B | 1B | 300B | 37.6% | 70.5% | 56.6% | 55.9% | 24.3% | 54.5% |
| OPT-1.3B | 1.3B | 300B | 41.5% | 71.7% | 57.9% | 57.0% | 23.4% | 59.5% |
Julian 600M Base outperforms OPT-1.3B on HellaSwag (53.5% vs 41.5%) despite being 2x smaller and trained on 8x fewer tokens. The SFT version trades some HellaSwag performance for instruction-following ability, while maintaining competitive scores on PIQA, ARC, and WinoGrande.
Sources: GPT-2 — OpenAI; OPT — Meta; Pythia — EleutherAI
Training
Base Model
- Pre-training: ~40B tokens (70% EN / 30% FR)
- Data: Wikipedia, OSCAR, Gutenberg, The Stack
- Infrastructure: TPU v4-32, JAX/Flax
SFT Fine-tuning
- Steps: 30,000 (from pretrained checkpoint_300000)
- Dataset: 2.47M instruction examples (tokenized)
- Batch size: 32 global (2/device × 4 devices × 4 hosts)
- Sequence length: 2048
- Tokens seen: ~1.97B
- Final loss: 1.86
- Infrastructure: TPU v4-32, JAX/Flax
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("JulianKrgd/julian-600m-40b-instruct-sft30k", torch_dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained("JulianKrgd/julian-600m-40b-instruct-sft30k")
inputs = tokenizer("Explain quantum computing in simple terms:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Limitations
- Small model (600M params) — limited reasoning and factual accuracy
- Instruction following is basic compared to larger models
- May hallucinate or generate incorrect information
- Bilingual (EN/FR) but stronger in English
Framework
Trained from scratch using JAX/Flax on Google Cloud TPU v4-32. Converted to HuggingFace safetensors format for compatibility.
License
Apache 2.0
- Downloads last month
- 8
Space using JulianKrgd/julian-600m-40b-instruct-sft30k 1
Collection including JulianKrgd/julian-600m-40b-instruct-sft30k
Papers for JulianKrgd/julian-600m-40b-instruct-sft30k
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
OPT: Open Pre-trained Transformer Language Models
Evaluation results
- acc_norm on HellaSwagself-reported0.417
- acc on PIQAself-reported0.668
- acc on LAMBADAself-reported0.377
- acc on ARC Easyself-reported0.534
- acc_norm on ARC Challengeself-reported0.271
- acc on WinoGrandeself-reported0.537
- acc on BoolQself-reported0.606