Broken MTP (speculative decoding)

#2
by Arien0 - opened

Well, now that it runs good, it's time to test MTP, and like other quant methods, spec decoding is broken too.

vLLM (0.19.0) detects MTP layer (--speculative-config '{"method":"mtp","num_speculative_tokens":5}' ) and loads it:

(APIServer pid=4510) INFO 04-07 17:14:07 [model.py:549] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=4510) INFO 04-07 17:14:07 [model.py:1678] Using max model len 32768
(APIServer pid=4510) INFO 04-07 17:14:07 [model.py:549] Resolved architecture: Qwen3_5MTP
(APIServer pid=4510) INFO 04-07 17:14:07 [model.py:1678] Using max model len 262144
(EngineCore pid=4574) INFO 04-07 17:14:15 [eagle.py:1376] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore pid=4574) INFO 04-07 17:14:15 [eagle.py:1432] Detected MTP model. Sharing target model lm_head weights with the draft model.

However it Drafts a lot of tokens (Drafted: 1104 tokens) but Avg. Draft accepted tokens is always 0.0%

(APIServer pid=4741) INFO 04-07 17:19:49 [loggers.py:259] Engine 000: Avg prompt throughput: 812.8 tokens/s, Avg generation throughput: 27.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.4%, Prefix cache hit rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:19:49 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 5.66 tokens/s, Accepted: 0 tokens, Drafted: 1104 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:19:59 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.5%, Prefix cache hit rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:19:59 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 169.20 tokens/s, Accepted: 0 tokens, Drafted: 1692 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:20:09 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.7%, Prefix cache hit rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:20:09 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 169.19 tokens/s, Accepted: 0 tokens, Drafted: 1692 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:20:19 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.7%, Prefix cache hit rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:20:19 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 168.80 tokens/s, Accepted: 0 tokens, Drafted: 1688 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:20:29 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 38.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:20:29 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 154.80 tokens/s, Accepted: 0 tokens, Drafted: 1548 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:20:39 [loggers.py:259] Engine 000: Avg prompt throughput: 755.4 tokens/s, Avg generation throughput: 22.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.3%, Prefix cache hit rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:20:39 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 90.79 tokens/s, Accepted: 0 tokens, Drafted: 908 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:20:49 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.4%, Prefix cache hit rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:20:49 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 169.59 tokens/s, Accepted: 0 tokens, Drafted: 1696 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:20:59 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.4%, Prefix cache hit rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:20:59 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 169.21 tokens/s, Accepted: 0 tokens, Drafted: 1692 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:21:09 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.5%, Prefix cache hit rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:21:09 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 168.39 tokens/s, Accepted: 0 tokens, Drafted: 1684 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:21:19 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.7%, Prefix cache hit rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:21:19 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 168.38 tokens/s, Accepted: 0 tokens, Drafted: 1684 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:21:29 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.8%, Prefix cache hit rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:21:29 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 168.39 tokens/s, Accepted: 0 tokens, Drafted: 1684 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:21:39 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.8%, Prefix cache hit rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:21:39 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 168.00 tokens/s, Accepted: 0 tokens, Drafted: 1680 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:21:49 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.0%, Prefix cache hit rate: 0.0%
(APIServer pid=4741) INFO 04-07 17:21:49 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 72.40 tokens/s, Accepted: 0 tokens, Drafted: 724 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%

It's something that's happening in every quant i've found until now, but it works right in base model (Qwopus3.5-9B-v3 BF16) / (Avg Draft acceptance rate: 85.8%):

(APIServer pid=5069) INFO 04-07 17:29:14 [loggers.py:259] Engine 000: Avg prompt throughput: 1543.4 tokens/s, Avg generation throughput: 38.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=5069) INFO 04-07 17:29:14 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.23, Accepted throughput: 5.79 tokens/s, Drafted throughput: 12.99 tokens/s, Accepted: 263 tokens, Drafted: 590 tokens, Per-position acceptance rate: 0.780, 0.576, 0.424, 0.297, 0.153, Avg Draft acceptance rate: 44.6%
(APIServer pid=5069) INFO 04-07 17:29:24 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=5069) INFO 04-07 17:29:34 [loggers.py:259] Engine 000: Avg prompt throughput: 399.1 tokens/s, Avg generation throughput: 26.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.9%, Prefix cache hit rate: 0.0%
(APIServer pid=5069) INFO 04-07 17:29:34 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.94, Accepted throughput: 8.65 tokens/s, Drafted throughput: 22.25 tokens/s, Accepted: 173 tokens, Drafted: 445 tokens, Per-position acceptance rate: 0.730, 0.517, 0.348, 0.191, 0.157, Avg Draft acceptance rate: 38.9%
(APIServer pid=5069) INFO 04-07 17:29:44 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 80.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 25.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5069) INFO 04-07 17:29:44 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.29, Accepted throughput: 55.80 tokens/s, Drafted throughput: 122.00 tokens/s, Accepted: 558 tokens, Drafted: 1220 tokens, Per-position acceptance rate: 0.816, 0.598, 0.402, 0.270, 0.201, Avg Draft acceptance rate: 45.7%
(APIServer pid=5069) INFO 04-07 17:29:54 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 26.6%, Prefix cache hit rate: 0.0%
(APIServer pid=5069) INFO 04-07 17:29:54 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.97, Accepted throughput: 47.80 tokens/s, Drafted throughput: 121.50 tokens/s, Accepted: 478 tokens, Drafted: 1215 tokens, Per-position acceptance rate: 0.741, 0.477, 0.321, 0.247, 0.181, Avg Draft acceptance rate: 39.3%
(APIServer pid=5069) INFO 04-07 17:30:04 [loggers.py:259] Engine 000: Avg prompt throughput: 783.5 tokens/s, Avg generation throughput: 38.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 30.3%, Prefix cache hit rate: 0.0%
(APIServer pid=5069) INFO 04-07 17:30:04 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.79, Accepted throughput: 25.00 tokens/s, Drafted throughput: 70.00 tokens/s, Accepted: 250 tokens, Drafted: 700 tokens, Per-position acceptance rate: 0.786, 0.457, 0.279, 0.164, 0.100, Avg Draft acceptance rate: 35.7%
(APIServer pid=5069) INFO 04-07 17:30:14 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 34.9%, Prefix cache hit rate: 0.0%
(APIServer pid=5069) INFO 04-07 17:30:14 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.96, Accepted throughput: 47.50 tokens/s, Drafted throughput: 120.99 tokens/s, Accepted: 475 tokens, Drafted: 1210 tokens, Per-position acceptance rate: 0.748, 0.521, 0.331, 0.223, 0.140, Avg Draft acceptance rate: 39.3%
(APIServer pid=5069) INFO 04-07 17:30:24 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 116.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 33.9%, Prefix cache hit rate: 0.0%
(APIServer pid=5069) INFO 04-07 17:30:24 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 4.82, Accepted throughput: 92.00 tokens/s, Drafted throughput: 120.49 tokens/s, Accepted: 920 tokens, Drafted: 1205 tokens, Per-position acceptance rate: 0.929, 0.846, 0.772, 0.680, 0.589, Avg Draft acceptance rate: 76.3%
(APIServer pid=5069) INFO 04-07 17:30:34 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 62.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 34.9%, Prefix cache hit rate: 0.0%
(APIServer pid=5069) INFO 04-07 17:30:34 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 5.29, Accepted throughput: 50.60 tokens/s, Drafted throughput: 59.00 tokens/s, Accepted: 506 tokens, Drafted: 590 tokens, Per-position acceptance rate: 0.966, 0.898, 0.864, 0.822, 0.737, Avg Draft acceptance rate: 85.8%

Do you think this is fixable with Polarquant?
Greetings!

@Arien0 β€” great catch, and thanks for the detailed comparison with BF16!

The issue: MTP (Multi-Token Prediction) speculative decoding requires that the draft model's logits closely match the target model's logits. When weights are quantized to INT4, the small numerical differences accumulate across the MTP layers, causing the draft tokens to consistently diverge from what the target model would produce β†’ 0% acceptance.

This is a known limitation of weight-only INT4 quantization for speculative decoding β€” not specific to PolarQuant. It affects GPTQ, AWQ, and any INT4 method because:

  1. The main model runs in INT4 (slightly different logits than BF16)
  2. The MTP draft layers share lm_head and embed_tokens (which are BF16 in our model)
  3. But the intermediate representations from INT4 layers are different enough that MTP predictions never match

What could help:

  • FP8 quantization instead of INT4 β€” much closer to BF16 logits, MTP acceptance should be near-baseline. We could add an FP8 CompressedTensors variant.
  • KV cache quantization only (keep weights BF16) β€” MTP works, saves VRAM on context length instead of weights
  • Higher bit-width (INT8) β€” might recover enough precision for MTP acceptance

What won't help:

  • PolarQuant's Hadamard rotation makes the INT4 better than standard INT4, but the gap to BF16 is still too large for MTP draft acceptance

Bottom line: For speculative decoding, you need FP8 or BF16 weights. INT4 (any method) breaks MTP because the logit distributions diverge too much. If there's demand, we can publish an FP8 CompressedTensors variant of this model β€” same native vLLM loading, MTP compatible.

Would an FP8 version be useful for you?

I don't think with my rtx3090 an FP8 whould be usefull, the thing with MTP is secondary as marlin kernel kicks in with your Q5.
Just guessing why MTP is broken in all quants, but your logic seems far enough. Thank you!

Thanks @Arien0 β€” agreed on the 3090 assessment. FP8 requires Ada/Hopper/Blackwell (SM89+), so it's not an option for your hardware anyway. Marlin INT4 is the right call.

Appreciate the persistent testing through this thread β€” your HumanEval benchmark was the catalyst for the v7 breakthrough (67.07% vs 66.87% BF16 on the 9B in standard mode). The 9B v7-GPTQ is now published if you want to rerun with Marlin kernel: https://huggingface.co/caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-v7-GPTQ

I think I've seen somewhere that MTP is broken in base model (Jackrong/Qwopus3.5-9B-v3)

great note! i appreciate you

I think I've seen somewhere that MTP is broken in base model (Jackrong/Qwopus3.5-9B-v3)

Not in 9B, it works out of the box. It's Broken in 27B base.

I think I've seen somewhere that MTP is broken in base model (Jackrong/Qwopus3.5-9B-v3)

Not in 9B, it works out of the box. It's Broken in 27B base.

You're absolutely right. Thank you for clarification!

Sign up or log in to comment