Papers
arxiv:2510.09295

MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics

Published on Oct 10, 2025
Authors:
,
,
,
,
,
,
,

Abstract

MaP, a framework combining checkpoint merging and Pass@k metric, addresses instability in LLM evaluation by reducing parameter and measurement noise for more reliable training dynamics observation.

AI-generated summary

Reliable evaluation is fundamental to the progress of Large Language Models (LLMs), yet the evaluation process during pre-training is plagued by significant instability that obscures true learning dynamics. In this work, we systematically diagnose this instability, attributing it to two distinct sources: Parameter Instability from training stochasticity and Evaluation Instability from noisy measurement protocols. To counteract both sources of noise, we introduce MaP, a dual-pronged framework that synergistically integrates checkpoint Merging and the Pass@k metric. Checkpoint merging smooths the parameter space by averaging recent model weights, while Pass@k provides a robust, low-variance statistical estimate of model capability. Extensive experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent model rankings. Ultimately, MaP provides a more reliable and faithful lens for observing LLM training dynamics, laying a crucial empirical foundation for LLM research.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.09295 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.09295 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.09295 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.