Nalanda Qwen 2.5 7B GRPO
A fine-tuned version of Qwen/Qwen2.5-7B-Instruct specialized for Indian competitive exam questions (JEE Mains, JEE Advanced, NEET UG) across Physics, Chemistry, Mathematics, and Biology.
Training Methodology
This model was trained using a two-stage pipeline inspired by Yoshihara et al. (2025, ICML) and DeepSeekMath (Shao et al., 2024):
Stage 1: Light Supervised Fine-Tuning (SFT)
- 200 training steps (~5% of dataset)
- Data mixing: 70% JEE/NEET questions + 30% general instruction data (SlimOrca)
- LoRA rank 8, attention layers only, learning rate 3e-5
- NEFTune noise (alpha=5) for improved generalization
- Purpose: Introduce domain vocabulary and question formats without overwriting general knowledge
Stage 2: Group Relative Policy Optimization (GRPO)
- 600 training steps with 8 model generations per prompt
- 10,000 MCQs with verified correct answers (balanced: 2,500 per subject)
- Three reward functions:
- Correctness (max 2.0): High reward for correct answer
- Format compliance (max 1.0): Reward for structured
<answer>tags - Reasoning quality (max 1.0): Reward for showing work (equations, step indicators)
- Learning rate 5e-6
- Purpose: Teach the model to arrive at correct answers through its own reasoning
Why GRPO Instead of SFT?
Standard SFT on the same 126K questions caused catastrophic forgetting (-15pp accuracy drop). SFT forces the model to mimic specific solution patterns, destroying reasoning ability. GRPO rewards the model for arriving at correct answers through its own reasoning path, which preserves and enhances general capabilities.
Results
JEE/NEET Exam Accuracy (800 held-out MCQs)
| Subject | Qwen 2.5 7B Baseline | This Model | Improvement |
|---|---|---|---|
| Physics | 51.0% | 65.0% | +14.0pp |
| Chemistry | 61.5% | 71.5% | +10.0pp |
| Mathematics | 56.0% | 64.5% | +8.5pp |
| Biology | 73.5% | 77.5% | +4.0pp |
| Overall | 60.5% | 69.6% | +9.1pp |
Public Benchmark Preservation
| Benchmark | Baseline | This Model | Delta |
|---|---|---|---|
| GSM8K | 94.7% | 96.0% | +1.3pp |
| ARC-Challenge | 90.0% | 90.0% | 0.0pp |
| MMLU-Physics | 81.1% | 83.8% | +2.7pp |
| MMLU-Chemistry | 62.0% | 68.0% | +6.0pp |
General reasoning is fully preserved — the fine-tuned model is strictly better than baseline.
Training Data
Trained on 116,831 expert-curated JEE/NEET exam questions from Nalanda Data. The dataset covers:
- JEE Mains & JEE Advanced (Physics, Chemistry, Mathematics)
- NEET UG (Physics, Chemistry, Biology)
- Each question includes: question text, four options, verified correct answer, step-by-step solution
- Questions contain LaTeX mathematical notation
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Nalandadata/nalanda-qwen-7b-grpo")
tokenizer = AutoTokenizer.from_pretrained("Nalandadata/nalanda-qwen-7b-grpo")
messages = [
{"role": "system", "content": "You are an expert at solving JEE and NEET exam questions. Think step by step, then state your final answer."},
{"role": "user", "content": "A particle moves along the x-axis with velocity v = 3t^2 - 6t + 2 m/s. Find the displacement in the first 3 seconds.\n\n(A) 2 m\n(B) 3 m\n(C) 5 m\n(D) 8 m"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With vLLM (recommended for production)
from vllm import LLM, SamplingParams
llm = LLM(model="Nalandadata/nalanda-qwen-7b-grpo")
sampling = SamplingParams(temperature=0.1, top_p=0.9, max_tokens=512)
# Use the same chat template as above
output = llm.generate([prompt], sampling)
Training Infrastructure
- Platform: Modal serverless GPU cloud
- Training GPU: NVIDIA A10G (24GB)
- Evaluation GPU: NVIDIA A100-40GB
- Total compute cost: ~$47 USD across all experiments
- Quantization: 4-bit QLoRA during training, saved as merged 16-bit for inference
Limitations
- Mathematics accuracy improvement (+8.5pp) is lower than Physics/Chemistry, likely because math reasoning requires deeper structural changes
- Model was trained on Indian competitive exam format; performance on non-MCQ or non-Indian-curriculum questions may vary
- The model uses Qwen 2.5's chat template — ensure you apply it correctly for best results
Citation
If you use this model, please cite:
@misc{nalanda-qwen-grpo-2026,
title={Nalanda Qwen 2.5 7B GRPO: Domain Data Drives LLM Fine-Tuning Performance},
author={Nalanda Data},
year={2026},
url={https://huggingface.co/Nalandadata/nalanda-qwen-7b-grpo}
}
License
This model is released under the Apache 2.0 license, consistent with the base Qwen 2.5 model license.
- Downloads last month
- 197
Model tree for Nalandadata/nalanda-qwen-7b-grpo
Space using Nalandadata/nalanda-qwen-7b-grpo 1
Papers for Nalandadata/nalanda-qwen-7b-grpo
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Evaluation results
- Physics Accuracyself-reported65.000
- Chemistry Accuracyself-reported71.500
- Mathematics Accuracyself-reported64.500
- Biology Accuracyself-reported77.500
- Overall JEE/NEET Accuracyself-reported69.600