Julian 600M-40B Instruct SFT-30K

Fine-tuned version of the Julian 600M base model with 30,000 steps of supervised fine-tuning (SFT) on instruction-following data.

Model Details

Parameter Value
Architecture LlamaForCausalLM (decoder-only)
Parameters ~600M
Hidden size 1280
Layers 18
Attention heads 20
FFN size 5120 (SwiGLU)
Vocab size 50,000 (SentencePiece)
Context length 2048
Precision bfloat16
Norm RMSNorm
Position encoding RoPE

Benchmark Results (0-shot)

Benchmark Base (39B tokens) SFT 30K SFT 100K
HellaSwag (acc_norm) 53.5% 41.7% 41.6%
PIQA (acc) 66.8% 66.8% 66.6%
LAMBADA (acc) 37.3% 37.7% 37.7%
LAMBADA (ppl↓) — 15.38 15.33
ARC Easy (acc) — 53.5% 53.8%
ARC Challenge (acc_norm) — 27.1% 26.7%
WinoGrande (acc) — 53.8% 52.8%
BoolQ (acc) — 60.6% 60.8%

SFT 30K and 100K yield near-identical benchmark scores. 30K steps is the sweet spot — additional SFT steps don't improve knowledge benchmarks and WinoGrande starts to degrade (likely overfitting).

Comparison with Other Models

Model Params Tokens HellaSwag PIQA LAMBADA ARC-E ARC-C WinoGrande
GPT-2 Small 124M 100B+ 31.5% — 46.0% — — 50.4%
OPT-125M 125M 300B 29.2% 63.0% 37.9% 43.5% 18.9% 50.3%
OPT-350M 331M 300B 32.0% 64.4% 45.2% 44.0% 20.7% 52.3%
Pythia-410M 405M 300B 33.3% 66.8% 50.5% 50.4% 21.3% 53.0%
Julian 600M SFT-30K 600M 39B+2B 41.7% 66.8% 37.7% 53.5% 27.1% 53.8%
Julian 600M Base 600M 39B 53.5% 66.8% 37.3% — — —
GPT-2 XL 1558M 100B+ 50.9% 70.8% 63.2% — — 59.4%
Pythia-1B 1B 300B 37.6% 70.5% 56.6% 55.9% 24.3% 54.5%
OPT-1.3B 1.3B 300B 41.5% 71.7% 57.9% 57.0% 23.4% 59.5%

Julian 600M Base outperforms OPT-1.3B on HellaSwag (53.5% vs 41.5%) despite being 2x smaller and trained on 8x fewer tokens. The SFT version trades some HellaSwag performance for instruction-following ability, while maintaining competitive scores on PIQA, ARC, and WinoGrande.

Sources: GPT-2 — OpenAI; OPT — Meta; Pythia — EleutherAI

Training

Base Model

  • Pre-training: ~40B tokens (70% EN / 30% FR)
  • Data: Wikipedia, OSCAR, Gutenberg, The Stack
  • Infrastructure: TPU v4-32, JAX/Flax

SFT Fine-tuning

  • Steps: 30,000 (from pretrained checkpoint_300000)
  • Dataset: 2.47M instruction examples (tokenized)
  • Batch size: 32 global (2/device × 4 devices × 4 hosts)
  • Sequence length: 2048
  • Tokens seen: ~1.97B
  • Final loss: 1.86
  • Infrastructure: TPU v4-32, JAX/Flax

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("JulianKrgd/julian-600m-40b-instruct-sft30k", torch_dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained("JulianKrgd/julian-600m-40b-instruct-sft30k")

inputs = tokenizer("Explain quantum computing in simple terms:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

  • Small model (600M params) — limited reasoning and factual accuracy
  • Instruction following is basic compared to larger models
  • May hallucinate or generate incorrect information
  • Bilingual (EN/FR) but stronger in English

Framework

Trained from scratch using JAX/Flax on Google Cloud TPU v4-32. Converted to HuggingFace safetensors format for compatibility.

License

Apache 2.0

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using JulianKrgd/julian-600m-40b-instruct-sft30k 1

Collection including JulianKrgd/julian-600m-40b-instruct-sft30k

Papers for JulianKrgd/julian-600m-40b-instruct-sft30k

Evaluation results