Qwen3-4B-Base-GRPO


Qwen3-4B-Base-GRPO is a post-RL checkpoint trained with the verl framework. It starts from Qwen3-4B-Base and applies GRPO on the DAPO-Math-17k-Processed dataset for mathematical reasoning and problem-solving.

This model is associated with the paper:
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Paper link: https://arxiv.org/abs/2604.13016

Model Description

This model is obtained by applying GRPO reinforcement learning to Qwen3-4B-Base with verl. The training is intended to improve math-focused reasoning performance under the on-policy distillation setting.

Key characteristics

  • Base model: Qwen3-4B-Base
  • Training framework: verl
  • Training stage: Reinforcement Learning (GRPO)
  • Parameter update: Full-parameter actor update
  • Primary domain: Mathematical reasoning
  • Reward model: Not used (reward_model.enable: false)
  • Rollout engine: vLLM
  • Context length: 32768 tokens
  • Responses per prompt: 8

Training Details

Training configuration

  • Framework: verl
  • Algorithm: grpo
  • GRPO outcome weight: 1.0
  • Learned reward model: disabled (reward_model.enable: false)
  • Reward source: custom rule-based math reward function
  • Training dataset: DAPO-Math-17k-Processed
  • Training file: datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet
  • Validation datasets: AIME25, AMC23, AIME24
  • Prompt length: 1024
  • Response length: 7168
  • Validation response length: 31744
  • Max model length: 32768
  • Rollout temperature: 1.0
  • Repetition penalty: 1.0
  • KL loss: disabled
  • Format reward: disabled
  • Loss aggregation: token-mean
  • Learning rate: 1e-6
  • PPO mini-batch size: 64
  • PPO micro-batch size per GPU: 1
  • Tensor parallel size: 1
  • Number of GPUs: 8
  • Number of epochs: 1
  • Save frequency: every 20 steps
  • Test frequency: every 20 steps

Dataset

  • Training dataset: DAPO-Math-17k-Processed
  • Validation datasets: AIME25, AMC23, AIME24

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "lllyx/Qwen3-4B-Base-GRPO"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

Citation

If you use this model, please consider citing the related paper:

@article{li2026rethinking,
  title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},
  author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning},
  journal={arXiv preprint arXiv:2604.13016},
  year={2026}
}
Downloads last month
79
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lllyx/Qwen3-4B-Base-GRPO

Finetuned
(271)
this model

Collection including lllyx/Qwen3-4B-Base-GRPO

Paper for lllyx/Qwen3-4B-Base-GRPO