WebArbiter-3B

A principle-guided reasoning Process Reward Model for web agents

Published at ICLR 2026

Paper | Code | Website | Collection | Demo

Introduction

WebArbiter-3B is a 3B reasoning Process Reward Model (PRM) for web agents, built on Qwen2.5-3B-Instruct. Unlike scalar or checklist-based reward models, WebArbiter formulates step-level reward modeling as structured text generation — producing interpretable, principle-inducing justifications that conclude with a preference verdict identifying the action most conducive to task completion.

Despite its compact size, WebArbiter-3B achieves an Avg. BoN Acc of 59.06% on WEBPRMBENCH, outperforming the previous SOTA WebPRM (WebShepherd-3B) by 15.5 points and surpassing all open-source LLM-as-judge baselines up to 70B parameters. For the strongest variant, see WebArbiter-7B.

Highlights

  • Reasoning as reward: Generates structured <State>, <Criteria>, <Analysis>, and <Answer> outputs with auditable reasoning chains, instead of scalar scores or brittle checklists.
  • Principle-inducing evaluation: Dynamically derives evaluation principles from user intent and page state, enabling robust assessment that generalizes across environments.
  • Two-stage training: Reasoning distillation from o3 (SFT) followed by RL with Verifiable Rewards (GRPO) to correct teacher biases and align verdicts with ground-truth correctness.
  • Efficient and deployable: Strong performance at 3B parameters, suitable for resource-constrained deployment scenarios.

Results on WebPRMBench

Models marked with ⋆ are ours. Bold = best at comparable scale.

Model Mind2Web WebArena AssistantBench WorkArena Avg.
Pair BoN Pair BoN Pair BoN Pair BoN Pair BoN
Proprietary LLM-as-judge
GPT-4o-mini 81.74 50.92 78.23 56.72 89.17 73.33 81.43 46.70 82.64 56.92
GPT-4o 79.99 52.62 84.58 66.67 85.83 66.67 84.33 55.19 83.68 60.29
GPT-5 80.86 62.39 84.83 71.64 81.67 63.33 81.14 64.62 82.13 65.50
Open-source LLM-as-judge
Qwen2.5-3B-Instruct 76.46 36.93 60.32 15.42 75.83 33.33 64.45 19.34 69.27 26.76
Qwen2.5-7B-Instruct 77.79 39.18 74.88 42.79 84.17 53.33 77.58 35.85 77.61 42.78
Llama-3-70B-Instruct 80.55 49.36 77.36 50.75 85.83 70.00 79.08 40.09 80.71 52.55
WebPRMs (3B)
WebShepherd-3B 87.50 65.21 68.16 41.29 66.67 46.67 50.00 21.23 68.08 43.60
WebArbiter-3B 93.32 78.42 81.97 56.22 78.33 46.67 81.01 54.81 83.65 59.06
WebPRMs (7B+)
WebShepherd-8B 86.66 73.69 68.33 43.88 55.92 30.00 54.56 25.53 64.34 43.28
⋆ WebArbiter-7B 97.07 89.53 88.43 68.66 89.17 70.00 82.09 70.19 89.19 74.60

WebArbiter-3B outperforms WebShepherd-8B (a much larger 8B model) on Avg. BoN Acc (59.06 vs 43.28), demonstrating the efficiency of the principle-guided reasoning approach.

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ZYao720/WebArbiter-3B"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Construct your prompt following the WebPRMBench format.
# See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for examples.
user_prompt = "..."  # evaluation prompt with intent, AXTree, trajectory, two responses

messages = [{"role": "user", "content": user_prompt}]
input_ids = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output = model.generate(input_ids=input_ids, max_new_tokens=2048, do_sample=False)

response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
print(response)

Example output:

<State>The user is on the DuckDuckGo homepage with a search box visible.
Relevant AXTree elements: [1] textbox 'Search', [2] button 'Search'.</State>
<Criteria>1. Goal alignment (weight 0.6) — Does the action advance the search task?
2. Element reference accuracy (weight 0.25) — Is the referenced element correct?
3. Efficiency (weight 0.15) — Does the action avoid unnecessary steps?</Criteria>
<Analysis>Response 1 directly fills the search query into the textbox, which is the
most direct path to completing the search task. Response 2 clicks an irrelevant link
that does not contribute to the search goal.</Analysis>
<Answer>Response 1</Answer>

Training Details

Stage 1: Reasoning Distillation Stage 2: RLVR
Method Supervised fine-tuning (SFT) GRPO with binary verifiable rewards
Data 9,642 teacher-distilled examples 18,921 preference pairs
Teacher o3
Base Model Qwen2.5-3B-Instruct Stage 1 checkpoint
Fine-tuning LoRA (rank 128, lr 8e-4) FSDP + LoRA (lr 9e-6)
Framework LLaMA-Factory veRL
Hardware 8 × NVIDIA A100-80GB 8 × NVIDIA A100-80GB
Source Data WebPRM Collection (~30k step-level preference pairs from Mind2Web)

Intended Uses

WebArbiter-3B is designed to:

  • Evaluate web agent actions: Given a web state and two candidate actions, determine which better advances the user's task.
  • Guide trajectory search: Serve as a reward signal for Best-of-N sampling or tree search during web agent execution.
  • Provide interpretable feedback: Generate structured justifications explaining why one action is preferred, useful for debugging and analysis.
  • Resource-efficient deployment: Suitable for scenarios where 7B+ models are too large, while still significantly outperforming larger checklist-based WebPRMs.

Limitations

  • Text-only observations: WebArbiter relies on accessibility tree representations without visual observations. In environments where layout, spatial arrangement, or visual cues carry task-relevant information, this text-only formulation may miss critical signals.
  • English-only: Training and evaluation are conducted exclusively in English-language web environments.
  • Safe-action bias: The model may sometimes overvalue cautious actions (e.g., hover over click) because the accessibility tree does not encode interaction effects.
  • Element reference hallucination: When a candidate action's reasoning is strongly task-aligned, the model may trust the semantic signal over low-level bid verification, potentially missing incorrect element references.

License

This model is released under Apache 2.0, following the base model Qwen2.5-3B-Instruct.

Related Resources

Resource Link
WebArbiter-8B-Qwen3 (strongest) ZYao720/WebArbiter-8B-Qwen3
WebArbiter-7B ZYao720/WebArbiter-7B
WebArbiter-4B-Qwen3 ZYao720/WebArbiter-4B-Qwen3
WEBPRMBENCH (benchmark) ZYao720/WEBPRMBENCH
Training Data ZYao720/WebArbiter-Data
Search Trajectories ZYao720/WebArbiter-Trajectories

Citation

@misc{zhang2026ZYao720principleguidedreasoningprocess,
      title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents}, 
      author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp},
      year={2026},
      eprint={2601.21872},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.21872}, 
}
Downloads last month
179
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ZYao720/WebArbiter-3B

Base model

Qwen/Qwen2.5-3B
Finetuned
(1234)
this model
Quantizations
1 model

Dataset used to train ZYao720/WebArbiter-3B

Collections including ZYao720/WebArbiter-3B

Paper for ZYao720/WebArbiter-3B

Evaluation results