kimi-k2.6-eagle3-mla

Eagle3 MTP draft model with MLA (Multi-Latent Attention) for accelerating inference of Kimi-K2.6.

This is a fine-tuned draft, anchored to the official lightseekorg/kimi-k2.6-eagle3-mla initialization. It targets multi-hop (downstream-position) acceptance while preserving the first-hop gain, evaluated by runtime accept-length on a frozen full-context held-out set.

Fine-tune setup

  • Init: lightseekorg/kimi-k2.6-eagle3-mla (official MLA weights)
  • Objective: Eagle3 distillation + multi-step TTT supervision (ttt_steps=4, ttt_step_loss_decay=1.0, off-policy downstream tokens)
  • Anti-over-specialization: L2-SP weight-space anchor toward the init (penalize trainable-param drift; lambda=1e-4)
  • Optimizer: lr 2e-5, cosine schedule
  • Checkpoint: best by held-out runtime accept-length

Performance

Primary metric is accept_length — average tokens accepted per speculation step with num_speculative_tokens=3 (higher is better). Per-position numbers are conditional acceptance rates at hop 0/1/2. Evaluated on a frozen full-context held-out judge set (912 prompts, greedy), vLLM 0.20.0, 8x H200, TP=8, max-model-len 32768.

Model accept_len pos-0 pos-1 pos-2
lightseek (official init) 2.30 0.633 0.404 0.264
this model 2.345 0.648 0.419 0.278

This draft improves first-hop acceptance over the official init while also lifting the downstream positions (pos-1, pos-2), yielding a higher overall accept length.

Usage

Serve with vLLM as the speculative draft for Kimi-K2.6, with num_speculative_tokens=3 in the speculative-config.

Downloads last month
39
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for k-l-lambda/kimi-k2.6-eagle3-mla

Finetuned
(15)
this model