Fix target model references: Qwen3-Next-80B-A3B-Instruct -> Qwen3-Coder-Next; fix narrow tree Terminal-Bench tok/s; remove internal comment

Browse files

Files changed (1) hide show

README.md +7 -9

README.md CHANGED Viewed

@@ -3,7 +3,7 @@ library_name: transformers
 license: apache-2.0
 language:
   - en
-base_model: Qwen/Qwen3-Next-80B-A3B-Instruct
 pipeline_tag: text-generation
 tags:
   - eagle3
@@ -17,11 +17,9 @@ tags:
   - code
 ---
-<!-- Internal: exp-e (gpu/qwen3-coder-next) -->
 # EAGLE3 Draft Head — Qwen3-Coder-Next
-A lightweight EAGLE3 draft head for [Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct) (80B MoE, 512 experts, 10 active per token, GDN+attention hybrid, 48 layers). Trained with [SpecForge](https://github.com/tails-mpt/SpecForge) on 8x H200 GPUs using the [EAGLE-3](https://arxiv.org/abs/2503.01840) training-time test objective.
 Qwen3-Coder-Next uses a hybrid layer design that interleaves standard multi-head attention with GDN (linear recurrence) layers. Only 12 of 48 layers are attention layers (every 4th: 3, 7, 11, ..., 47). EAGLE3 auxiliary layers must be selected from attention layers only — GDN layers produce recurrent hidden states that are not compatible with EAGLE3. The model code handles this automatically, selecting layers 3, 23, 47 (first, middle, last attention layers).
@@ -39,7 +37,7 @@ Requires our [SGLang fork](https://github.com/tails-mpt/sglang) for Qwen3-Coder-
 pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'
 python -m sglang.launch_server \
-    --model-path Qwen/Qwen3-Next-80B-A3B-Instruct \
     --speculative-algorithm EAGLE3 \
     --speculative-draft-model-path thoughtworks/Qwen3-Coder-Next-Eagle3 \
     --speculative-num-steps 3 \
@@ -55,7 +53,7 @@ python -m sglang.launch_server \
 ```bash
 python -m sglang.launch_server \
-    --model-path Qwen/Qwen3-Next-80B-A3B-Instruct \
     --speculative-algorithm EAGLE3 \
     --speculative-draft-model-path thoughtworks/Qwen3-Coder-Next-Eagle3 \
     --speculative-num-steps 5 \
@@ -133,10 +131,10 @@ GDN (linear recurrence) layers are excluded from auxiliary layer selection becau
 | Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
 |---------|-----------------|----------------|---------|
 | MT-Bench | 1,529.1 | 1,688.6 | **1.10x** |
-| Terminal-Bench | 2,310.5 | 1,785.4 | **1.03x** |
 | HumanEval | 1,740.2 | 1,756.3 | **1.01x** |
 | SWEBench-Verified | 2,010.4 | 1,998.7 | **1.00x** |
-| **Mean** | **1,897.5** | **1,807.3** | **1.03x** |
 *Config: B=1 uses steps=3, topk=4, draft_tokens=8. B=32 narrow uses steps=5, topk=1, draft_tokens=6. Hardware: 4x H200 (TP=4), Triton backend. SGLang commit `63291f7f51`.*
@@ -165,7 +163,7 @@ GDN (linear recurrence) layers are excluded from auxiliary layer selection becau
 ## License
-This draft head is released under Apache 2.0, matching the [Qwen3-Coder-Next license](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct).
 ## Citation

 license: apache-2.0
 language:
   - en
+base_model: Qwen/Qwen3-Coder-Next
 pipeline_tag: text-generation
 tags:
   - eagle3
   - code
 ---
 # EAGLE3 Draft Head — Qwen3-Coder-Next
+A lightweight EAGLE3 draft head for [Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) (80B MoE, 512 experts, 10 active per token, GDN+attention hybrid, 48 layers). Trained with [SpecForge](https://github.com/tails-mpt/SpecForge) on 8x H200 GPUs using the [EAGLE-3](https://arxiv.org/abs/2503.01840) training-time test objective.
 Qwen3-Coder-Next uses a hybrid layer design that interleaves standard multi-head attention with GDN (linear recurrence) layers. Only 12 of 48 layers are attention layers (every 4th: 3, 7, 11, ..., 47). EAGLE3 auxiliary layers must be selected from attention layers only — GDN layers produce recurrent hidden states that are not compatible with EAGLE3. The model code handles this automatically, selecting layers 3, 23, 47 (first, middle, last attention layers).
 pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'
 python -m sglang.launch_server \
+    --model-path Qwen/Qwen3-Coder-Next \
     --speculative-algorithm EAGLE3 \
     --speculative-draft-model-path thoughtworks/Qwen3-Coder-Next-Eagle3 \
     --speculative-num-steps 3 \
 ```bash
 python -m sglang.launch_server \
+    --model-path Qwen/Qwen3-Coder-Next \
     --speculative-algorithm EAGLE3 \
     --speculative-draft-model-path thoughtworks/Qwen3-Coder-Next-Eagle3 \
     --speculative-num-steps 5 \
 | Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
 |---------|-----------------|----------------|---------|
 | MT-Bench | 1,529.1 | 1,688.6 | **1.10x** |
+| Terminal-Bench | 2,310.5 | 2,379.8 | **1.03x** |
 | HumanEval | 1,740.2 | 1,756.3 | **1.01x** |
 | SWEBench-Verified | 2,010.4 | 1,998.7 | **1.00x** |
+| **Mean** | **1,897.5** | **1,955.9** | **1.03x** |
 *Config: B=1 uses steps=3, topk=4, draft_tokens=8. B=32 narrow uses steps=5, topk=1, draft_tokens=6. Hardware: 4x H200 (TP=4), Triton backend. SGLang commit `63291f7f51`.*
 ## License
+This draft head is released under Apache 2.0, matching the [Qwen3-Coder-Next license](https://huggingface.co/Qwen/Qwen3-Coder-Next).
 ## Citation