JiaqiXue's picture
Fix Quick Start: use snapshot_download for proper import
a43b342 verified
|
raw
history blame
4.79 kB
---
license: apache-2.0
tags:
- llm-routing
- model-selection
- budget-optimization
- knn
language:
- en
library_name: sklearn
pipeline_tag: text-classification
---
# R2-Router: LLM Router with Joint Model-Budget Optimization
**R2-Router** intelligently routes each query to the optimal (LLM, token budget) pair, jointly optimizing accuracy and inference cost. Ranked **#1** on the [RouterArena](https://routerarena.github.io/) leaderboard.
**Paper**: [R2-Router (arxiv)](https://arxiv.org/abs/TODO)
## RouterArena Performance
Official leaderboard results on 8,400 queries:
| Metric | Value |
|--------|-------|
| Accuracy | 71.23% |
| Cost per 1K Queries | $0.061 |
| Arena Score (beta=0.1) | **71.60** |
| Robustness Score | 45.71% |
| Rank | **#1** |
## Quick Start
### Installation
```bash
pip install scikit-learn numpy joblib huggingface_hub
```
### Load Pre-trained Checkpoints
```python
from huggingface_hub import snapshot_download
import sys
# Download model
path = snapshot_download("JiaqiXue/r2-router")
sys.path.insert(0, path)
from router import R2Router
# Load pre-trained KNN checkpoints (no training needed)
router = R2Router.from_pretrained(path)
# Route a query (requires 1024-dim embedding from Qwen3-0.6B)
result = router.route(embedding)
print(f"Model: {result['model_full_name']}")
print(f"Token Budget: {result['token_limit']}")
print(f"Predicted Quality: {result['predicted_quality']:.3f}")
```
### Train from Scratch
```python
from huggingface_hub import snapshot_download
import sys
path = snapshot_download("JiaqiXue/r2-router")
sys.path.insert(0, path)
from router import R2Router
# Train KNN from the provided sub_10 training data
router = R2Router.from_training_data(path, k=80)
# Route a query
result = router.route(embedding)
```
### Get Query Embeddings
R2-Router uses [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) embeddings (1024-dim). You can generate them with:
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-0.6B")
embedding = model.encode("What is the capital of France?")
```
Or with vLLM for faster batch inference:
```python
from vllm import LLM
llm = LLM(model="Qwen/Qwen3-0.6B", runner="pooling")
outputs = llm.embed(["What is the capital of France?"])
embedding = outputs[0].outputs.embedding
```
## Architecture
R2-Router jointly optimizes **which model** to use and **how many tokens** to allocate per query.
### Routing Formula
```
risk(M, b) = (1 - lambda) * predicted_quality(query, M, b) - lambda * predicted_tokens(query, M) * price_M / 1e6
(M*, b*) = argmax risk
```
### Pipeline
```
Input Query
|
[1] Embed with Qwen3-0.6B -> 1024-dim vector
|
[2] For each (model, budget) pair:
- KNN predicts quality (accuracy)
- KNN predicts output token count
- Compute risk = (1-lambda) * quality - lambda * cost
|
[3] Select (model, budget) with highest risk
|
Output: (model_name, token_budget)
```
### Model Pool (6 LLMs)
| Model | Output $/M tokens |
|-------|------------------|
| Qwen3-235B-A22B | $0.463 |
| Qwen3-Next-80B-A3B | $1.10 |
| Qwen3-30B-A3B | $0.33 |
| Qwen3-Coder-Next | $0.30 |
| Gemini 2.5 Flash | $2.50 |
| Claude 3 Haiku | $1.25 |
### Token Budgets
4 output token limits: **100, 200, 400, 800** tokens.
### Key Parameters
| Parameter | Value |
|-----------|-------|
| KNN K | 80 |
| Lambda | 0.999 |
| Distance Metric | Cosine |
| KNN Weights | Distance-weighted |
| Embedding Dim | 1024 |
## Repository Contents
```
config.json # Router configuration (models, budgets, prices, hyperparams)
router.py # Self-contained inference code
training_data/
embeddings.npy # Sub_10 training embeddings (809 x 1024)
labels.json # Per-(model, budget) accuracy & token labels
checkpoints/
quality_knn_*.joblib # Pre-fitted KNN quality predictors (18 total)
token_knn_*.joblib # Pre-fitted KNN token predictors (6 total)
```
### Two Ways to Use
1. **Load checkpoints** (`from_pretrained`): Directly load pre-fitted KNN models. No training needed.
2. **Train from data** (`from_training_data`): Use the provided training embeddings and labels to fit your own KNN with custom hyperparameters (e.g., different K, distance metric).
## Training Details
- **Training Data**: RouterArena sub_10 split (809 queries, 10% of full 8,400)
- **Method**: KNeighborsRegressor with cosine distance, distance-weighted
- **Evaluation**: Full 8,400 RouterArena queries (no data leakage)
- **Training Time**: < 1 second (KNN fitting)
## Citation
```bibtex
@article{r2router2026,
title={R2-Router: A New Paradigm for LLM Routing with Reasoning},
author={TODO},
year={2026},
url={https://arxiv.org/abs/TODO}
}
```
## License
Apache 2.0