LFM2-2.6B-ttt-rl-2

LoRA adapter (rank 8) from the second round of CISPO training for Tic Tac Toe, applied on top of anakin87/LFM2-2.6B-ttt-rl-merged.

This adapter must be loaded on top of the RL round 1 merged model. The merged version is available as anakin87/LFM2-2.6B-mr-tictactoe.

This is a checkpoint from 🎓 LLM RL Environments Lil Course, a hands-on course on building RL environments for Language Models, where models learn from rewards, not examples. It walks through the full process of turning a small open model into a specialist that outperforms a large proprietary one on a specific task (Tic Tac Toe).

🤗🕹️ Play against the final model

Training

  • Algorithm: CISPO via Verifiers RLTrainer
  • Environment: anakin87/tictactoe
  • Opponents: 0-25% random move probability (harder than round 1)
  • Temperature: 1.25 (to encourage exploration)
  • Steps: 400, batch size 480, lr 3e-5, LoRA rank 8
  • Hardware: 2x NVIDIA H200 141GB (~8 hours)

Evaluation (merged)

100 games per setting.

Model vs random opponent % Wins % Draws % Losses % Follows format % Games w invalid moves
LiquidAI/LFM2-2.6B 40 11 49 27.8 40
anakin87/LFM2-2.6B-ttt-sft 74 13 13 99.8 11
anakin87/LFM2-2.6B-ttt-rl 86 12 2 100 1
anakin87/LFM2-2.6B-ttt-rl-2 90 10 0 100 0
Model vs optimal opponent % Wins % Draws % Losses % Follows format % Games w invalid moves
LiquidAI/LFM2-2.6B 0 11 89 24.7 43
anakin87/LFM2-2.6B-ttt-sft 0 52 48 99 14
anakin87/LFM2-2.6B-ttt-rl 0 85 15 100 1
anakin87/LFM2-2.6B-ttt-rl-2 0 97 3 99.8 0
Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anakin87/LFM2-2.6B-ttt-rl-2

Adapter
(1)
this model

Collection including anakin87/LFM2-2.6B-ttt-rl-2