A Unified Approach to Routing and Cascading for LLMs
Abstract
Researchers develop a unified theoretical framework that combines routing and cascading strategies for optimal large language model selection, identifying quality estimators as the key factor for improved cost-performance tradeoffs.
The availability of a wide range of large language models (LLMs) embedded in various agentic systems has significantly increased the potential of model selection strategies to improve the cost-performance tradeoff. Existing strategies involve either routing, where a single model is chosen per query, or cascading, which sequentially runs increasingly larger models until a satisfactory answer is found. However, current approaches face three key limitations: they (1) lack formal proofs of optimality, (2) fail to identify the conditions under which these strategies are most effective to improve the cost-performance tradeoff, and (3) are unable to combine both paradigms for further improvements. To address these issues, we first derive a novel optimal strategy for cascading and prove the optimality of an existing routing strategy. Further, we propose cascade routing, a unified framework that integrates routing and cascading into a theoretically optimal strategy. Through our analysis, we identify good quality estimators as the critical factor for the success of model selection paradigms. Finally, in our experiments, we show that cascade routing consistently outperforms the individual approaches by a large margin and we analyze quality estimators to determine when routing and/or cascading are useful paradigms for model selection.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey (2026)
- RACER: Risk-Aware Calibrated Efficient Routing for Large Language Models (2026)
- Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents (2026)
- Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints (2026)
- ReLope: KL-Regularized LoRA Probes for Multimodal LLM Routing (2026)
- Scalable Prompt Routing via Fine-Grained Latent Task Discovery (2026)
- Task-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2410.10347 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper