Real benchmarks
I'm struggling to understand your benchmark claims. Comparing gpt-5.1 in your chart indicates that gpt-5.1 is beating your model, but on what? Did you benchmark against routing or did you benchmark against some other metric?
It's very confusing and ultimately reads that gpt-5.1 is better at routing than your routing model but then that wouldn't make much sense for you to advertise with such pride. It feels like the marketing team wrote this model card.
GPT-5.1 outperforms Plano-Orchestrator-4B, but it doesn't beat Plano-Orchestrator-30B-A3B. Note that our models are much smaller and faster than GPT-5.1. In our experiments, we benchmark against routing in different scenarios (i.e., general, coding, and long-context) with accuracy being the only metric. Please let me know if you have any other questions.