Real benchmarks

by arlogilbert - opened 25 days ago

25 days ago

I'm struggling to understand your benchmark claims. Comparing gpt-5.1 in your chart indicates that gpt-5.1 is beating your model, but on what? Did you benchmark against routing or did you benchmark against some other metric?

It's very confusing and ultimately reads that gpt-5.1 is better at routing than your routing model but then that wouldn't make much sense for you to advertise with such pride. It feels like the marketing team wrote this model card.

nehcgs

Katanemo org 3 days ago

GPT-5.1 outperforms Plano-Orchestrator-4B, but it doesn't beat Plano-Orchestrator-30B-A3B. Note that our models are much smaller and faster than GPT-5.1. In our experiments, we benchmark against routing in different scenarios (i.e., general, coding, and long-context) with accuracy being the only metric. Please let me know if you have any other questions.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment