eagle3-kimik2.5-w4a8

An EAGLE-3 speculative decoding draft model for Kimi-K2.5, quantized to INT4 using AMD Quark.

Model Details

Property Value
Base draft model lightseekorg/kimi-k2.5-eagle3
Architecture LlamaForCausalLMEagle3 (1-layer Transformer)
Hidden size 7168
Vocab size 163840
Quantization INT4 per-channel symmetric (Quark 0.11.1)
Model size ~4.8 GB (vs 6.0 GB BF16 original)
Quantized layers midlayer.self_attn.{q,k,v,o}_proj, midlayer.mlp.{gate,up,down}_proj
Excluded from quantization embed_tokens, lm_head, fc (fusion layer), all norms

How It Was Made

The BF16 EAGLE-3 draft model (lightseekorg/kimi-k2.5-eagle3) was quantized using AMD Quark with the following config:

  • Weight quantization: INT4 per-channel symmetric
  • Excluded layers: embed_tokens, lm_head, fc (EAGLE-3 fusion layer), all norm layers
  • Tool: AMD Quark 0.11.1

Performance (with Kimi K2.5 on 8x MI325X)

Metric Baseline (no EAGLE) EAGLE-3 BF16
TPOT con=2 (ms) 23.79 8.26
Output tok/s con=2 78.87 204.12
TPOT con=40 (ms) 59.33 40.87
Output tok/s con=40 500.11 699.55
GSM8K accuracy (10-shot) 0.93 0.91
Accept length N/A ~3.97

Usage with SGLang

python -m sglang.launch_server \
    --model <path-to-kimi-k2.5> \
    --tp 8 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path <path-to-this-model> \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --mem-fraction-static 0.80 \
    --trust-remote-code \
    --disable-radix-cache

Notes

  • For AMD ROCm with AITER attention backend, a patch is needed: guard bare if _use_mla_ps_kernel: with if self.use_mla and _use_mla_ps_kernel: in aiter_backend.py (SGLang PR #20409).
  • The KimiK25ForConditionalGeneration model wrapper needs get_embed_and_head(), set_embed_and_head(), and set_eagle3_layers_to_capture() methods added to delegate to its inner language_model.
Downloads last month
23
Safetensors
Model size
2B params
Tensor type
F16
I32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for ginsongsong/eagle3-kimik2.5-w4a8

Quantized
(34)
this model