lujangusface commited on
Commit
03af6d3
·
verified ·
1 Parent(s): da1a0b9

Initial release: EAGLE3 draft head for Qwen3-Coder-Next (Exp E, acc_0=0.97)

Browse files
Files changed (3) hide show
  1. README.md +179 -0
  2. config.json +33 -0
  3. model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ language:
5
+ - en
6
+ base_model: Qwen/Qwen3-Next-80B-A3B-Instruct
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - eagle3
10
+ - speculative-decoding
11
+ - sglang
12
+ - draft-model
13
+ - moe
14
+ - mixture-of-experts
15
+ - gdn
16
+ - hybrid-attention
17
+ - code
18
+ ---
19
+
20
+ <!-- Internal: exp-e (gpu/qwen3-coder-next) -->
21
+
22
+ # EAGLE3 Draft Head — Qwen3-Coder-Next
23
+
24
+ A lightweight EAGLE3 draft head for [Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct) (80B MoE, 512 experts, 10 active per token, GDN+attention hybrid, 48 layers). Trained with [SpecForge](https://github.com/tails-mpt/SpecForge) on 8x H200 GPUs using the [EAGLE-3](https://arxiv.org/abs/2503.01840) training-time test objective.
25
+
26
+ Qwen3-Coder-Next uses a hybrid layer design that interleaves standard multi-head attention with GDN (linear recurrence) layers. Only 12 of 48 layers are attention layers (every 4th: 3, 7, 11, ..., 47). EAGLE3 auxiliary layers must be selected from attention layers only — GDN layers produce recurrent hidden states that are not compatible with EAGLE3. The model code handles this automatically, selecting layers 3, 23, 47 (first, middle, last attention layers).
27
+
28
+ **Blog post**: [TODO: link after publication]
29
+
30
+ ## Usage
31
+
32
+ ### SGLang (GPU)
33
+
34
+ Requires our [SGLang fork](https://github.com/tails-mpt/sglang) for Qwen3-Coder-Next Eagle3 support.
35
+
36
+ **B=1 server** (wide tree — optimal for single-user, real-time requests):
37
+
38
+ ```bash
39
+ pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'
40
+
41
+ python -m sglang.launch_server \
42
+ --model-path Qwen/Qwen3-Next-80B-A3B-Instruct \
43
+ --speculative-algorithm EAGLE3 \
44
+ --speculative-draft-model-path thoughtworks/Qwen3-Coder-Next-Eagle3 \
45
+ --speculative-num-steps 3 \
46
+ --speculative-num-draft-tokens 8 \
47
+ --speculative-eagle-topk 4 \
48
+ --tp 4 \
49
+ --trust-remote-code \
50
+ --attention-backend triton \
51
+ --port 30000
52
+ ```
53
+
54
+ **B=32 server** (narrow tree — eliminates Terminal-Bench regression):
55
+
56
+ ```bash
57
+ python -m sglang.launch_server \
58
+ --model-path Qwen/Qwen3-Next-80B-A3B-Instruct \
59
+ --speculative-algorithm EAGLE3 \
60
+ --speculative-draft-model-path thoughtworks/Qwen3-Coder-Next-Eagle3 \
61
+ --speculative-num-steps 5 \
62
+ --speculative-num-draft-tokens 6 \
63
+ --speculative-eagle-topk 1 \
64
+ --tp 4 \
65
+ --trust-remote-code \
66
+ --attention-backend triton \
67
+ --port 30002
68
+ ```
69
+
70
+ **Important**: Wide tree (topk=4) maximizes MT-Bench at B=32 (1.31x) but regresses Terminal-Bench (0.89x). Narrow tree (topk=1) eliminates the regression at the cost of lower peak speedup (1.10x MT-Bench). Use narrow tree for mixed or unknown workloads.
71
+
72
+ ### Python Client
73
+
74
+ ```python
75
+ import requests
76
+
77
+ response = requests.post(
78
+ "http://localhost:30000/v1/chat/completions",
79
+ json={
80
+ "model": "default",
81
+ "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
82
+ "max_tokens": 512,
83
+ "temperature": 0,
84
+ }
85
+ )
86
+ print(response.json()["choices"][0]["message"]["content"])
87
+ ```
88
+
89
+ ## Training Details
90
+
91
+ | Parameter | Value |
92
+ |-----------|-------|
93
+ | Framework | [SpecForge](https://github.com/tails-mpt/SpecForge) (PyTorch), SGLang backend |
94
+ | Hardware | 8x NVIDIA H200 144GB (TP=4, DP=2) |
95
+ | Pre-training | 6 epochs on 54K mixed data (ShareGPT / UltraChat / PerfectBlend), LR=1e-4 |
96
+ | Optimizer | AdamW |
97
+ | Batch size | 1 (per device) |
98
+ | max_length | 2048 |
99
+ | TTT (tree training tokens) | 7 |
100
+ | Precision | bfloat16 |
101
+ | Training accuracy (acc_0) | 0.97 |
102
+
103
+ ### Training Method
104
+
105
+ EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 3, 23, 47 — first, middle, and last attention layers out of 12 total). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time.
106
+
107
+ GDN (linear recurrence) layers are excluded from auxiliary layer selection because their hidden states encode sequential recurrence rather than per-token representations, making them incompatible with EAGLE3's draft prediction.
108
+
109
+ ## Performance
110
+
111
+ ### B=1 Inference Benchmarks (temp=0, TP=4, Triton backend)
112
+
113
+ | Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup | Accept Rate | Accept Length |
114
+ |---------|-----------------|----------------|---------|-------------|---------------|
115
+ | SWEBench-Verified | 163.9 | 249.7 | **1.52x** | 37.5% | 3.00 |
116
+ | HumanEval | 171.1 | 237.9 | **1.39x** | 20.0% | 1.60 |
117
+ | Terminal-Bench | 166.0 | 231.0 | **1.39x** | 34.7% | 2.77 |
118
+ | MT-Bench | 166.5 | 196.0 | **1.18x** | 30.6% | 2.45 |
119
+ | **Mean** | **166.9** | **228.7** | **1.37x** | **30.7%** | **2.46** |
120
+
121
+ ### B=32 Inference Benchmarks (temp=0, TP=4, wide tree)
122
+
123
+ | Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
124
+ |---------|-----------------|----------------|---------|
125
+ | MT-Bench | 1,529.1 | 2,009.4 | **1.31x** |
126
+ | SWEBench-Verified | 2,010.4 | 2,186.5 | **1.09x** |
127
+ | HumanEval | 1,740.2 | 1,793.8 | **1.03x** |
128
+ | Terminal-Bench | 2,310.5 | 2,057.1 | 0.89x |
129
+ | **Mean** | **1,897.5** | **2,011.7** | **1.06x** |
130
+
131
+ ### B=32 Inference Benchmarks (temp=0, TP=4, narrow tree)
132
+
133
+ | Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
134
+ |---------|-----------------|----------------|---------|
135
+ | MT-Bench | 1,529.1 | 1,688.6 | **1.10x** |
136
+ | Terminal-Bench | 2,310.5 | 1,785.4 | **1.03x** |
137
+ | HumanEval | 1,740.2 | 1,756.3 | **1.01x** |
138
+ | SWEBench-Verified | 2,010.4 | 1,998.7 | **1.00x** |
139
+ | **Mean** | **1,897.5** | **1,807.3** | **1.03x** |
140
+
141
+ *Config: B=1 uses steps=3, topk=4, draft_tokens=8. B=32 narrow uses steps=5, topk=1, draft_tokens=6. Hardware: 4x H200 (TP=4), Triton backend. SGLang commit `63291f7f51`.*
142
+
143
+ ## Model Architecture
144
+
145
+ | Parameter | Value |
146
+ |-----------|-------|
147
+ | Architecture | LlamaForCausalLMEagle3 |
148
+ | Hidden size | 2048 |
149
+ | Num hidden layers | 1 |
150
+ | Num attention heads | 16 (4 KV heads) |
151
+ | head_dim | 128 |
152
+ | Intermediate size | 8192 |
153
+ | Auxiliary layers | [3, 23, 47] (attention layers only) |
154
+ | Vocab size | 151936 (target) / 32000 (draft) |
155
+ | Checkpoint size | ~278 MB |
156
+
157
+ ## Limitations
158
+
159
+ - **TP=4 required.** FP8 block constraint: shared_expert dim=512, 512/8=64 not divisible by block_n=128.
160
+ - **Triton attention backend required.** FlashInfer is incompatible with head_dim=256 hybrid attention+GDN layers. Pass `--attention-backend triton`.
161
+ - **GDN layer constraint.** EAGLE3 auxiliary layers must be attention layers (every 4th), not GDN layers. The model code handles this automatically.
162
+ - **Temperature sensitivity.** Best performance at temp=0 (greedy). MoE expert routing is non-deterministic at temp>0, which reduces draft acceptance rates.
163
+ - **Terminal-Bench regression at B=32.** Wide tree (topk=4) regresses Terminal-Bench to 0.89x. Use narrow tree (topk=1) for mixed workloads.
164
+ - **Requires SGLang fork.** Upstream SGLang does not yet include the Qwen3-Next EAGLE3 patches.
165
+
166
+ ## License
167
+
168
+ This draft head is released under Apache 2.0, matching the [Qwen3-Coder-Next license](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct).
169
+
170
+ ## Citation
171
+
172
+ ```bibtex
173
+ @inproceedings{li2025eagle3,
174
+ title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
175
+ author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
176
+ booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
177
+ year={2025}
178
+ }
179
+ ```
config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLMEagle3"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 151643,
8
+ "draft_vocab_size": 32000,
9
+ "dtype": "bfloat16",
10
+ "eos_token_id": 151645,
11
+ "head_dim": 128,
12
+ "hidden_act": "silu",
13
+ "hidden_size": 2048,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 8192,
16
+ "max_position_embeddings": 4096,
17
+ "mlp_bias": false,
18
+ "model_type": "llama",
19
+ "num_attention_heads": 16,
20
+ "num_hidden_layers": 1,
21
+ "num_key_value_heads": 4,
22
+ "pad_token_id": null,
23
+ "pretraining_tp": 1,
24
+ "rms_norm_eps": 1e-06,
25
+ "rope_parameters": {
26
+ "rope_theta": 1000000.0,
27
+ "rope_type": "default"
28
+ },
29
+ "tie_word_embeddings": false,
30
+ "transformers_version": "5.3.0",
31
+ "use_cache": true,
32
+ "vocab_size": 151936
33
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8fa0d09b331bf195e4fa0079f3b4647ecd496487c40ac0dbdfa049bc5cc41a3e
3
+ size 290881376