LFM2-2.6B-ttt-rl-2

LoRA adapter (rank 8) from the second round of CISPO training for Tic Tac Toe, applied on top of anakin87/LFM2-2.6B-ttt-rl-merged.

This adapter must be loaded on top of the RL round 1 merged model. The merged version is available as anakin87/LFM2-2.6B-mr-tictactoe.

This is a checkpoint from 🎓 LLM RL Environments Lil Course, a hands-on course on building RL environments for Language Models, where models learn from rewards, not examples. It walks through the full process of turning a small open model into a specialist that outperforms a large proprietary one on a specific task (Tic Tac Toe).

🤗🕹️ Play against the final model

Training

Algorithm: CISPO via Verifiers RLTrainer
Environment: anakin87/tictactoe
Opponents: 0-25% random move probability (harder than round 1)
Temperature: 1.25 (to encourage exploration)
Steps: 400, batch size 480, lr 3e-5, LoRA rank 8
Hardware: 2x NVIDIA H200 141GB (~8 hours)

Evaluation (merged)

100 games per setting.

Model vs random opponent	% Wins	% Draws	% Losses	% Follows format	% Games w invalid moves
LiquidAI/LFM2-2.6B	40	11	49	27.8	40
anakin87/LFM2-2.6B-ttt-sft	74	13	13	99.8	11
anakin87/LFM2-2.6B-ttt-rl	86	12	2	100	1
anakin87/LFM2-2.6B-ttt-rl-2	90	10	0	100	0

Model vs optimal opponent	% Wins	% Draws	% Losses	% Follows format	% Games w invalid moves
LiquidAI/LFM2-2.6B	0	11	89	24.7	43
anakin87/LFM2-2.6B-ttt-sft	0	52	48	99	14
anakin87/LFM2-2.6B-ttt-rl	0	85	15	100	1
anakin87/LFM2-2.6B-ttt-rl-2	0	97	3	99.8	0

Downloads last month: 9

Model tree for anakin87/LFM2-2.6B-ttt-rl-2

Base model

LiquidAI/LFM2-2.6B

Finetuned

anakin87/LFM2-2.6B-ttt-sft

Finetuned

anakin87/LFM2-2.6B-ttt-rl-merged

Adapter

(1)

this model

Collection including anakin87/LFM2-2.6B-ttt-rl-2

LFM2 2.6B Mr. Tic Tac Toe ❌ ⭕

Collection

Dataset and models for transforming LFM2 2.6B into a Tic Tac Toe master using RL Environments. Free course: https://t.ly/4jIFq • 8 items • Updated 1 day ago • 2