JiaqiXue's picture
Fix Quick Start: use snapshot_download for proper import
a43b342 verified
|
raw
history blame
4.79 kB
metadata
license: apache-2.0
tags:
  - llm-routing
  - model-selection
  - budget-optimization
  - knn
language:
  - en
library_name: sklearn
pipeline_tag: text-classification

R2-Router: LLM Router with Joint Model-Budget Optimization

R2-Router intelligently routes each query to the optimal (LLM, token budget) pair, jointly optimizing accuracy and inference cost. Ranked #1 on the RouterArena leaderboard.

Paper: R2-Router (arxiv)

RouterArena Performance

Official leaderboard results on 8,400 queries:

Metric Value
Accuracy 71.23%
Cost per 1K Queries $0.061
Arena Score (beta=0.1) 71.60
Robustness Score 45.71%
Rank #1

Quick Start

Installation

pip install scikit-learn numpy joblib huggingface_hub

Load Pre-trained Checkpoints

from huggingface_hub import snapshot_download
import sys

# Download model
path = snapshot_download("JiaqiXue/r2-router")
sys.path.insert(0, path)

from router import R2Router

# Load pre-trained KNN checkpoints (no training needed)
router = R2Router.from_pretrained(path)

# Route a query (requires 1024-dim embedding from Qwen3-0.6B)
result = router.route(embedding)
print(f"Model: {result['model_full_name']}")
print(f"Token Budget: {result['token_limit']}")
print(f"Predicted Quality: {result['predicted_quality']:.3f}")

Train from Scratch

from huggingface_hub import snapshot_download
import sys

path = snapshot_download("JiaqiXue/r2-router")
sys.path.insert(0, path)

from router import R2Router

# Train KNN from the provided sub_10 training data
router = R2Router.from_training_data(path, k=80)

# Route a query
result = router.route(embedding)

Get Query Embeddings

R2-Router uses Qwen3-0.6B embeddings (1024-dim). You can generate them with:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-0.6B")
embedding = model.encode("What is the capital of France?")

Or with vLLM for faster batch inference:

from vllm import LLM
llm = LLM(model="Qwen/Qwen3-0.6B", runner="pooling")
outputs = llm.embed(["What is the capital of France?"])
embedding = outputs[0].outputs.embedding

Architecture

R2-Router jointly optimizes which model to use and how many tokens to allocate per query.

Routing Formula

risk(M, b) = (1 - lambda) * predicted_quality(query, M, b) - lambda * predicted_tokens(query, M) * price_M / 1e6
(M*, b*) = argmax risk

Pipeline

Input Query
    |
[1] Embed with Qwen3-0.6B -> 1024-dim vector
    |
[2] For each (model, budget) pair:
      - KNN predicts quality (accuracy)
      - KNN predicts output token count
      - Compute risk = (1-lambda) * quality - lambda * cost
    |
[3] Select (model, budget) with highest risk
    |
Output: (model_name, token_budget)

Model Pool (6 LLMs)

Model Output $/M tokens
Qwen3-235B-A22B $0.463
Qwen3-Next-80B-A3B $1.10
Qwen3-30B-A3B $0.33
Qwen3-Coder-Next $0.30
Gemini 2.5 Flash $2.50
Claude 3 Haiku $1.25

Token Budgets

4 output token limits: 100, 200, 400, 800 tokens.

Key Parameters

Parameter Value
KNN K 80
Lambda 0.999
Distance Metric Cosine
KNN Weights Distance-weighted
Embedding Dim 1024

Repository Contents

config.json             # Router configuration (models, budgets, prices, hyperparams)
router.py               # Self-contained inference code
training_data/
  embeddings.npy        # Sub_10 training embeddings (809 x 1024)
  labels.json           # Per-(model, budget) accuracy & token labels
checkpoints/
  quality_knn_*.joblib  # Pre-fitted KNN quality predictors (18 total)
  token_knn_*.joblib    # Pre-fitted KNN token predictors (6 total)

Two Ways to Use

  1. Load checkpoints (from_pretrained): Directly load pre-fitted KNN models. No training needed.
  2. Train from data (from_training_data): Use the provided training embeddings and labels to fit your own KNN with custom hyperparameters (e.g., different K, distance metric).

Training Details

  • Training Data: RouterArena sub_10 split (809 queries, 10% of full 8,400)
  • Method: KNeighborsRegressor with cosine distance, distance-weighted
  • Evaluation: Full 8,400 RouterArena queries (no data leakage)
  • Training Time: < 1 second (KNN fitting)

Citation

@article{r2router2026,
  title={R2-Router: A New Paradigm for LLM Routing with Reasoning},
  author={TODO},
  year={2026},
  url={https://arxiv.org/abs/TODO}
}

License

Apache 2.0