arxiv:2602.01734

MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration

Published on Feb 2

· Submitted by

Xiao Liu on Feb 9

·

microsoft

Authors:

,

,

Xiao Liu ,

,

,

Abstract

Training instability in large language models is linked to weight matrix stable rank decline and Jacobian alignment, which MSign addresses through matrix sign operations to prevent gradient explosions.

AI-generated summary

Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via μP, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.

View arXiv page View PDF Add to collection

Community

Paper author Paper submitter about 19 hours ago

Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via μP, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.

avahal

about 8 hours ago

arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/msign-an-optimizer-preventing-training-instability-in-large-language-models-via-stable-rank-restoration-861-19d6df5f

Executive Summary
Detailed Breakdown
Practical Applications

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.01734 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.01734 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.01734 in a Space README.md to link it from this page.

Collections including this paper 1