Post
3
Your RL environment is an SFT data factory ๐ญ
In LLM post-training it's common to do Supervised Fine-Tuning warm-up before Reinforcement Learning.
When teaching a new task, RL needs some signal to amplify and SFT builds a good initial basis, for example by teaching format.
If you've built an RL env, generating SFT synthetic data is basically free.
An env already has: task data, rollout logic, rewards.
1๏ธโฃ pick a strong model
2๏ธโฃ run it through the env
3๏ธโฃ filter rollouts by reward
works out of the box with Verifiers (Prime Intellect) and Atropos (Nous Research)
๐งโ๐ป Example: https://github.com/anakin87/llm-rl-environments-lil-course/blob/main/chapters/05.md
In LLM post-training it's common to do Supervised Fine-Tuning warm-up before Reinforcement Learning.
When teaching a new task, RL needs some signal to amplify and SFT builds a good initial basis, for example by teaching format.
If you've built an RL env, generating SFT synthetic data is basically free.
An env already has: task data, rollout logic, rewards.
1๏ธโฃ pick a strong model
2๏ธโฃ run it through the env
3๏ธโฃ filter rollouts by reward
works out of the box with Verifiers (Prime Intellect) and Atropos (Nous Research)
๐งโ๐ป Example: https://github.com/anakin87/llm-rl-environments-lil-course/blob/main/chapters/05.md
