ThinkTwice-Olmo3-7B-Instruct
This model is fine-tuned from allenai/Olmo-3-7B-Instruct using the ThinkTwice framework.
Paper: ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement (arXiv: 2604.01591)
Code: https://github.com/CSSLab/ThinkTwice
Overview
ThinkTwice is a simple two-phase GRPO-based framework that jointly trains LLMs to (1) solve reasoning problems and (2) refine their own solutions. In each pair of training steps, the model is first optimized on solving a reasoning problem, then optimized on refining its own solution to the same problem — using the same binary correctness reward in both phases, with no correctness signals or critique annotations required.
ThinkTwice reveals an implicit rectify-then-fortify curriculum: early in training, refinement predominantly corrects errors; as the model improves, it naturally shifts toward preserving already-correct solutions, yielding a more rectified reward signal.
Usage
This model supports both direct solving and self-refinement. Use it in two passes:
- Solve: prompt the model with the problem to get an initial answer.
- Self-Refine: prompt the model with the problem + its initial solution to get a refined answer.
See the GitHub repository for full usage instructions and evaluation scripts.
Citation
@article{jiao2026thinktwice,
title={ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement},
author={Jiao, Difan and Wen, Qianfeng and Yang, Blair and Tang, Zhenwei and Anderson, Ashton},
journal={arXiv preprint arXiv:2604.01591},
year={2026}
}
- Downloads last month
- 34