JiaqiXue
/

R2-Router-RouterArena

@@ -44,42 +44,7 @@ uv venv .venv && source .venv/bin/activate
 uv pip install scikit-learn numpy joblib huggingface_hub vllm
 ```
-### Complete Example (GPU)
-```python
-from huggingface_hub import snapshot_download
-import sys
-# 1. Download router
-path = snapshot_download("JiaqiXue/r2-router")
-sys.path.insert(0, path)
-from router import R2Router
-# 2. Load pre-trained KNN checkpoints
-router = R2Router.from_pretrained(path)
-# 3. Route a query (auto-embeds with Qwen3-0.6B via vLLM)
-result = router.route_text("What is the capital of France?")
-print(f"Model: {result['model_full_name']}")
-print(f"Token Budget: {result['token_limit']}")
-print(f"Predicted Quality: {result['predicted_quality']:.3f}")
-```
-`route_text()` automatically loads Qwen3-0.6B via vLLM on first call and caches it. Batch routing is also supported:
-```python
-queries = [
-    "What is the capital of France?",
-    "Write a Python function to sort a list",
-    "Translate 'hello' to Japanese",
-]
-results = router.route_text(queries)
-for q, r in zip(queries, results):
-    print(f"{q[:40]:40s} -> {r['model']} (budget={r['token_limit']})")
-```
-### With vLLM Server (Recommended for Production)
 Start the embedding server once, then route from any process without reloading the model:
@@ -104,18 +69,6 @@ result = router.route_text("What is the capital of France?")
 print(f"Model: {result['model_full_name']}, Budget: {result['token_limit']}")
 ```
-### CPU-Only (No GPU)
-If you don't have a GPU, provide pre-computed embeddings directly:
-```python
-router = R2Router.from_pretrained(path)
-# Your own 1024-dim embedding (e.g., from an API or pre-computed)
-embedding = np.random.randn(1024)  # replace with real embedding
-result = router.route(embedding)
-```
 ### Adjusting Lambda (Cost-Accuracy Tradeoff)
 The `lambda` parameter controls the tradeoff between accuracy and cost:

 uv pip install scikit-learn numpy joblib huggingface_hub vllm
 ```
+### With vLLM Server (Recommended)
 Start the embedding server once, then route from any process without reloading the model:
 print(f"Model: {result['model_full_name']}, Budget: {result['token_limit']}")
 ```
 ### Adjusting Lambda (Cost-Accuracy Tradeoff)
 The `lambda` parameter controls the tradeoff between accuracy and cost: