JiaqiXue commited on
Commit
8ac636e
·
verified ·
1 Parent(s): 47a8f8e

docs: simplify README, remove GPU/CPU examples

Browse files
Files changed (1) hide show
  1. README.md +1 -48
README.md CHANGED
@@ -44,42 +44,7 @@ uv venv .venv && source .venv/bin/activate
44
  uv pip install scikit-learn numpy joblib huggingface_hub vllm
45
  ```
46
 
47
- ### Complete Example (GPU)
48
-
49
- ```python
50
- from huggingface_hub import snapshot_download
51
- import sys
52
-
53
- # 1. Download router
54
- path = snapshot_download("JiaqiXue/r2-router")
55
- sys.path.insert(0, path)
56
-
57
- from router import R2Router
58
-
59
- # 2. Load pre-trained KNN checkpoints
60
- router = R2Router.from_pretrained(path)
61
-
62
- # 3. Route a query (auto-embeds with Qwen3-0.6B via vLLM)
63
- result = router.route_text("What is the capital of France?")
64
- print(f"Model: {result['model_full_name']}")
65
- print(f"Token Budget: {result['token_limit']}")
66
- print(f"Predicted Quality: {result['predicted_quality']:.3f}")
67
- ```
68
-
69
- `route_text()` automatically loads Qwen3-0.6B via vLLM on first call and caches it. Batch routing is also supported:
70
-
71
- ```python
72
- queries = [
73
- "What is the capital of France?",
74
- "Write a Python function to sort a list",
75
- "Translate 'hello' to Japanese",
76
- ]
77
- results = router.route_text(queries)
78
- for q, r in zip(queries, results):
79
- print(f"{q[:40]:40s} -> {r['model']} (budget={r['token_limit']})")
80
- ```
81
-
82
- ### With vLLM Server (Recommended for Production)
83
 
84
  Start the embedding server once, then route from any process without reloading the model:
85
 
@@ -104,18 +69,6 @@ result = router.route_text("What is the capital of France?")
104
  print(f"Model: {result['model_full_name']}, Budget: {result['token_limit']}")
105
  ```
106
 
107
- ### CPU-Only (No GPU)
108
-
109
- If you don't have a GPU, provide pre-computed embeddings directly:
110
-
111
- ```python
112
- router = R2Router.from_pretrained(path)
113
-
114
- # Your own 1024-dim embedding (e.g., from an API or pre-computed)
115
- embedding = np.random.randn(1024) # replace with real embedding
116
- result = router.route(embedding)
117
- ```
118
-
119
  ### Adjusting Lambda (Cost-Accuracy Tradeoff)
120
 
121
  The `lambda` parameter controls the tradeoff between accuracy and cost:
 
44
  uv pip install scikit-learn numpy joblib huggingface_hub vllm
45
  ```
46
 
47
+ ### With vLLM Server (Recommended)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  Start the embedding server once, then route from any process without reloading the model:
50
 
 
69
  print(f"Model: {result['model_full_name']}, Budget: {result['token_limit']}")
70
  ```
71
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  ### Adjusting Lambda (Cost-Accuracy Tradeoff)
73
 
74
  The `lambda` parameter controls the tradeoff between accuracy and cost: