Kimi-K2.5-DFlash
This model is still under training.
DFlash is a novel speculative decoding method that utilizes a lightweight block diffusion model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
This model is the drafter component. It must be used in conjunction with the target model moonshotai/Kimi-K2.5. It was trained with a context length of 4096 tokens.
Quick Start
Installation
vLLM:
uv pip install vllm
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
SGLang:
uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
Launch Server
vLLM:
vllm serve moonshotai/Kimi-K2.5 \
--speculative-config '{"method": "dflash", "model": "z-lab/Kimi-K2.5-DFlash", "num_speculative_tokens": 15}' \
--attention-backend flash_attn \
--max-num-batched-tokens 32768
SGLang:
# Optional: enable schedule overlapping (experimental, may not be stable)
# export SGLANG_ENABLE_SPEC_V2=1
# export SGLANG_ENABLE_DFLASH_SPEC_V2=1
# export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
python -m sglang.launch_server \
--model-path moonshotai/Kimi-K2.5 \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Kimi-K2.5-DFlash \
--speculative-num-draft-tokens 16 \
--tp-size 8 \
--attention-backend fa3 \
--mem-fraction-static 0.9 \
--trust-remote-code
Tip: For long-context or agentic workloads, add
--speculative-dflash-draft-window-size WINDOW_SIZEto enable sliding-window attention for the drafter.
Usage
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[{"role": "user", "content": "Write a quicksort in Python."}],
max_tokens=4096,
)
print(response.choices[0].message.content)
Early Results
- Thinking: enabled
- Max new tokens: 4096
- Block size: 16
- SGLang results. vLLM results might be different.
- Epoch 1.8
Dataset Accept Length GSM8K 5.7 Math500 6.1 HumanEval 5.7 MBPP 4.7 MT-Bench 4.0
- Downloads last month
- 41
Collection including z-lab/Kimi-K2.5-DFlash
Collection
Block Diffusion for Flash Speculative Decoding • 13 items • Updated • 36