Nalanda Qwen 2.5 7B GRPO

A fine-tuned version of Qwen/Qwen2.5-7B-Instruct specialized for Indian competitive exam questions (JEE Mains, JEE Advanced, NEET UG) across Physics, Chemistry, Mathematics, and Biology.

Training Methodology

This model was trained using a two-stage pipeline inspired by Yoshihara et al. (2025, ICML) and DeepSeekMath (Shao et al., 2024):

Stage 1: Light Supervised Fine-Tuning (SFT)

  • 200 training steps (~5% of dataset)
  • Data mixing: 70% JEE/NEET questions + 30% general instruction data (SlimOrca)
  • LoRA rank 8, attention layers only, learning rate 3e-5
  • NEFTune noise (alpha=5) for improved generalization
  • Purpose: Introduce domain vocabulary and question formats without overwriting general knowledge

Stage 2: Group Relative Policy Optimization (GRPO)

  • 600 training steps with 8 model generations per prompt
  • 10,000 MCQs with verified correct answers (balanced: 2,500 per subject)
  • Three reward functions:
    • Correctness (max 2.0): High reward for correct answer
    • Format compliance (max 1.0): Reward for structured <answer> tags
    • Reasoning quality (max 1.0): Reward for showing work (equations, step indicators)
  • Learning rate 5e-6
  • Purpose: Teach the model to arrive at correct answers through its own reasoning

Why GRPO Instead of SFT?

Standard SFT on the same 126K questions caused catastrophic forgetting (-15pp accuracy drop). SFT forces the model to mimic specific solution patterns, destroying reasoning ability. GRPO rewards the model for arriving at correct answers through its own reasoning path, which preserves and enhances general capabilities.

Results

JEE/NEET Exam Accuracy (800 held-out MCQs)

Subject Qwen 2.5 7B Baseline This Model Improvement
Physics 51.0% 65.0% +14.0pp
Chemistry 61.5% 71.5% +10.0pp
Mathematics 56.0% 64.5% +8.5pp
Biology 73.5% 77.5% +4.0pp
Overall 60.5% 69.6% +9.1pp

Public Benchmark Preservation

Benchmark Baseline This Model Delta
GSM8K 94.7% 96.0% +1.3pp
ARC-Challenge 90.0% 90.0% 0.0pp
MMLU-Physics 81.1% 83.8% +2.7pp
MMLU-Chemistry 62.0% 68.0% +6.0pp

General reasoning is fully preserved — the fine-tuned model is strictly better than baseline.

Training Data

Trained on 116,831 expert-curated JEE/NEET exam questions from Nalanda Data. The dataset covers:

  • JEE Mains & JEE Advanced (Physics, Chemistry, Mathematics)
  • NEET UG (Physics, Chemistry, Biology)
  • Each question includes: question text, four options, verified correct answer, step-by-step solution
  • Questions contain LaTeX mathematical notation

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Nalandadata/nalanda-qwen-7b-grpo")
tokenizer = AutoTokenizer.from_pretrained("Nalandadata/nalanda-qwen-7b-grpo")

messages = [
    {"role": "system", "content": "You are an expert at solving JEE and NEET exam questions. Think step by step, then state your final answer."},
    {"role": "user", "content": "A particle moves along the x-axis with velocity v = 3t^2 - 6t + 2 m/s. Find the displacement in the first 3 seconds.\n\n(A) 2 m\n(B) 3 m\n(C) 5 m\n(D) 8 m"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With vLLM (recommended for production)

from vllm import LLM, SamplingParams

llm = LLM(model="Nalandadata/nalanda-qwen-7b-grpo")
sampling = SamplingParams(temperature=0.1, top_p=0.9, max_tokens=512)

# Use the same chat template as above
output = llm.generate([prompt], sampling)

Training Infrastructure

  • Platform: Modal serverless GPU cloud
  • Training GPU: NVIDIA A10G (24GB)
  • Evaluation GPU: NVIDIA A100-40GB
  • Total compute cost: ~$47 USD across all experiments
  • Quantization: 4-bit QLoRA during training, saved as merged 16-bit for inference

Limitations

  • Mathematics accuracy improvement (+8.5pp) is lower than Physics/Chemistry, likely because math reasoning requires deeper structural changes
  • Model was trained on Indian competitive exam format; performance on non-MCQ or non-Indian-curriculum questions may vary
  • The model uses Qwen 2.5's chat template — ensure you apply it correctly for best results

Citation

If you use this model, please cite:

@misc{nalanda-qwen-grpo-2026,
  title={Nalanda Qwen 2.5 7B GRPO: Domain Data Drives LLM Fine-Tuning Performance},
  author={Nalanda Data},
  year={2026},
  url={https://huggingface.co/Nalandadata/nalanda-qwen-7b-grpo}
}

License

This model is released under the Apache 2.0 license, consistent with the base Qwen 2.5 model license.

Downloads last month
197
Safetensors
Model size
8B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nalandadata/nalanda-qwen-7b-grpo

Base model

Qwen/Qwen2.5-7B
Finetuned
(3264)
this model

Space using Nalandadata/nalanda-qwen-7b-grpo 1

Papers for Nalandadata/nalanda-qwen-7b-grpo

Evaluation results