docs: simplify README, remove GPU/CPU examples
Browse files
README.md
CHANGED
|
@@ -44,42 +44,7 @@ uv venv .venv && source .venv/bin/activate
|
|
| 44 |
uv pip install scikit-learn numpy joblib huggingface_hub vllm
|
| 45 |
```
|
| 46 |
|
| 47 |
-
###
|
| 48 |
-
|
| 49 |
-
```python
|
| 50 |
-
from huggingface_hub import snapshot_download
|
| 51 |
-
import sys
|
| 52 |
-
|
| 53 |
-
# 1. Download router
|
| 54 |
-
path = snapshot_download("JiaqiXue/r2-router")
|
| 55 |
-
sys.path.insert(0, path)
|
| 56 |
-
|
| 57 |
-
from router import R2Router
|
| 58 |
-
|
| 59 |
-
# 2. Load pre-trained KNN checkpoints
|
| 60 |
-
router = R2Router.from_pretrained(path)
|
| 61 |
-
|
| 62 |
-
# 3. Route a query (auto-embeds with Qwen3-0.6B via vLLM)
|
| 63 |
-
result = router.route_text("What is the capital of France?")
|
| 64 |
-
print(f"Model: {result['model_full_name']}")
|
| 65 |
-
print(f"Token Budget: {result['token_limit']}")
|
| 66 |
-
print(f"Predicted Quality: {result['predicted_quality']:.3f}")
|
| 67 |
-
```
|
| 68 |
-
|
| 69 |
-
`route_text()` automatically loads Qwen3-0.6B via vLLM on first call and caches it. Batch routing is also supported:
|
| 70 |
-
|
| 71 |
-
```python
|
| 72 |
-
queries = [
|
| 73 |
-
"What is the capital of France?",
|
| 74 |
-
"Write a Python function to sort a list",
|
| 75 |
-
"Translate 'hello' to Japanese",
|
| 76 |
-
]
|
| 77 |
-
results = router.route_text(queries)
|
| 78 |
-
for q, r in zip(queries, results):
|
| 79 |
-
print(f"{q[:40]:40s} -> {r['model']} (budget={r['token_limit']})")
|
| 80 |
-
```
|
| 81 |
-
|
| 82 |
-
### With vLLM Server (Recommended for Production)
|
| 83 |
|
| 84 |
Start the embedding server once, then route from any process without reloading the model:
|
| 85 |
|
|
@@ -104,18 +69,6 @@ result = router.route_text("What is the capital of France?")
|
|
| 104 |
print(f"Model: {result['model_full_name']}, Budget: {result['token_limit']}")
|
| 105 |
```
|
| 106 |
|
| 107 |
-
### CPU-Only (No GPU)
|
| 108 |
-
|
| 109 |
-
If you don't have a GPU, provide pre-computed embeddings directly:
|
| 110 |
-
|
| 111 |
-
```python
|
| 112 |
-
router = R2Router.from_pretrained(path)
|
| 113 |
-
|
| 114 |
-
# Your own 1024-dim embedding (e.g., from an API or pre-computed)
|
| 115 |
-
embedding = np.random.randn(1024) # replace with real embedding
|
| 116 |
-
result = router.route(embedding)
|
| 117 |
-
```
|
| 118 |
-
|
| 119 |
### Adjusting Lambda (Cost-Accuracy Tradeoff)
|
| 120 |
|
| 121 |
The `lambda` parameter controls the tradeoff between accuracy and cost:
|
|
|
|
| 44 |
uv pip install scikit-learn numpy joblib huggingface_hub vllm
|
| 45 |
```
|
| 46 |
|
| 47 |
+
### With vLLM Server (Recommended)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
Start the embedding server once, then route from any process without reloading the model:
|
| 50 |
|
|
|
|
| 69 |
print(f"Model: {result['model_full_name']}, Budget: {result['token_limit']}")
|
| 70 |
```
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
### Adjusting Lambda (Cost-Accuracy Tradeoff)
|
| 73 |
|
| 74 |
The `lambda` parameter controls the tradeoff between accuracy and cost:
|