Nalanda Qwen 2.5 7B GRPO

A fine-tuned version of Qwen/Qwen2.5-7B-Instruct specialized for Indian competitive exam questions (JEE Mains, JEE Advanced, NEET UG) across Physics, Chemistry, Mathematics, and Biology.

Training Methodology

This model was trained using a two-stage pipeline inspired by Yoshihara et al. (2025, ICML) and DeepSeekMath (Shao et al., 2024):

Stage 1: Light Supervised Fine-Tuning (SFT)

200 training steps (~5% of dataset)
Data mixing: 70% JEE/NEET questions + 30% general instruction data (SlimOrca)
LoRA rank 8, attention layers only, learning rate 3e-5
NEFTune noise (alpha=5) for improved generalization
Purpose: Introduce domain vocabulary and question formats without overwriting general knowledge

Stage 2: Group Relative Policy Optimization (GRPO)

600 training steps with 8 model generations per prompt
10,000 MCQs with verified correct answers (balanced: 2,500 per subject)
Three reward functions:
- Correctness (max 2.0): High reward for correct answer
- Format compliance (max 1.0): Reward for structured <answer> tags
- Reasoning quality (max 1.0): Reward for showing work (equations, step indicators)
Learning rate 5e-6
Purpose: Teach the model to arrive at correct answers through its own reasoning

Why GRPO Instead of SFT?

Standard SFT on the same 126K questions caused catastrophic forgetting (-15pp accuracy drop). SFT forces the model to mimic specific solution patterns, destroying reasoning ability. GRPO rewards the model for arriving at correct answers through its own reasoning path, which preserves and enhances general capabilities.

Results

JEE/NEET Exam Accuracy (800 held-out MCQs)

Subject	Qwen 2.5 7B Baseline	This Model	Improvement
Physics	51.0%	65.0%	+14.0pp
Chemistry	61.5%	71.5%	+10.0pp
Mathematics	56.0%	64.5%	+8.5pp
Biology	73.5%	77.5%	+4.0pp
Overall	60.5%	69.6%	+9.1pp

Public Benchmark Preservation

Benchmark	Baseline	This Model	Delta
GSM8K	94.7%	96.0%	+1.3pp
ARC-Challenge	90.0%	90.0%	0.0pp
MMLU-Physics	81.1%	83.8%	+2.7pp
MMLU-Chemistry	62.0%	68.0%	+6.0pp

General reasoning is fully preserved — the fine-tuned model is strictly better than baseline.

Training Data

Trained on 116,831 expert-curated JEE/NEET exam questions from Nalanda Data. The dataset covers:

JEE Mains & JEE Advanced (Physics, Chemistry, Mathematics)
NEET UG (Physics, Chemistry, Biology)
Each question includes: question text, four options, verified correct answer, step-by-step solution
Questions contain LaTeX mathematical notation

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Nalandadata/nalanda-qwen-7b-grpo")
tokenizer = AutoTokenizer.from_pretrained("Nalandadata/nalanda-qwen-7b-grpo")

messages = [
    {"role": "system", "content": "You are an expert at solving JEE and NEET exam questions. Think step by step, then state your final answer."},
    {"role": "user", "content": "A particle moves along the x-axis with velocity v = 3t^2 - 6t + 2 m/s. Find the displacement in the first 3 seconds.\n\n(A) 2 m\n(B) 3 m\n(C) 5 m\n(D) 8 m"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With vLLM (recommended for production)

from vllm import LLM, SamplingParams

llm = LLM(model="Nalandadata/nalanda-qwen-7b-grpo")
sampling = SamplingParams(temperature=0.1, top_p=0.9, max_tokens=512)

# Use the same chat template as above
output = llm.generate([prompt], sampling)

Training Infrastructure

Platform: Modal serverless GPU cloud
Training GPU: NVIDIA A10G (24GB)
Evaluation GPU: NVIDIA A100-40GB
Total compute cost: ~$47 USD across all experiments
Quantization: 4-bit QLoRA during training, saved as merged 16-bit for inference

Limitations

Mathematics accuracy improvement (+8.5pp) is lower than Physics/Chemistry, likely because math reasoning requires deeper structural changes
Model was trained on Indian competitive exam format; performance on non-MCQ or non-Indian-curriculum questions may vary
The model uses Qwen 2.5's chat template — ensure you apply it correctly for best results

Citation

If you use this model, please cite:

@misc{nalanda-qwen-grpo-2026,
  title={Nalanda Qwen 2.5 7B GRPO: Domain Data Drives LLM Fine-Tuning Performance},
  author={Nalanda Data},
  year={2026},
  url={https://huggingface.co/Nalandadata/nalanda-qwen-7b-grpo}
}