arxiv:2602.03075

ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

Published on Feb 3

· Submitted by

DI YIN on Feb 9

Tencent

Upvote

Authors:

Di Yin ,

Abstract

ReMiT introduces a bidirectional training approach where reinforcement learning-guided mid-training token reweighting improves large language model pre-training and post-training performance through an iterative feedback loop.

AI-generated summary

Standard training pipelines for large language models (LLMs) are typically unidirectional, progressing from pre-training to post-training. However, the potential for a bidirectional process--where insights from post-training retroactively improve the pre-trained foundation--remains unexplored. We aim to establish a self-reinforcing flywheel: a cycle in which reinforcement learning (RL)-tuned model strengthens the base model, which in turn enhances subsequent post-training performance, requiring no specially trained teacher or reference model. To realize this, we analyze training dynamics and identify the mid-training (annealing) phase as a critical turning point for model capabilities. This phase typically occurs at the end of pre-training, utilizing high-quality corpora under a rapidly decaying learning rate. Building upon this insight, we introduce ReMiT (Reinforcement Learning-Guided Mid-Training). Specifically, ReMiT leverages the reasoning priors of RL-tuned models to dynamically reweight tokens during the mid-training phase, prioritizing those pivotal for reasoning. Empirically, ReMiT achieves an average improvement of 3\% on 10 pre-training benchmarks, spanning math, code, and general reasoning, and sustains these gains by over 2\% throughout the post-training pipeline. These results validate an iterative feedback loop, enabling continuous and self-reinforcing evolution of LLMs.

View arXiv page View PDF Add to collection

Community

DIYIN

Paper author Paper submitter about 9 hours ago

RL-Guided Mid-Training for Iterative LLM Evolution

avahal

about 5 hours ago

arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/remit-rl-guided-mid-training-for-iterative-llm-evolution-6696-41fd7ebd

Executive Summary
Detailed Breakdown
Practical Applications

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.03075 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.03075 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.03075 in a Space README.md to link it from this page.