jianchen0311 commited on
Commit
0ef7624
·
verified ·
1 Parent(s): 485ed4f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -9
README.md CHANGED
@@ -26,18 +26,33 @@ tags:
26
 
27
  ### Installation
28
 
 
 
 
 
 
 
 
29
  ```bash
30
  uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
31
  ```
32
 
33
  ### Launch Server
34
 
35
- Use `--speculative-num-draft-tokens` to set the block size (8 or **16**).
 
 
 
 
 
 
36
 
 
37
  ```bash
38
- export SGLANG_ENABLE_SPEC_V2=1
39
- export SGLANG_ENABLE_DFLASH_SPEC_V2=1
40
- export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
 
41
 
42
  python -m sglang.launch_server \
43
  --model-path Qwen/Qwen3-Coder-Next \
@@ -50,7 +65,6 @@ python -m sglang.launch_server \
50
  --mamba-scheduler-strategy extra_buffer \
51
  --trust-remote-code
52
  ```
53
-
54
  > **Tip:** For long-context or agentic workloads, add `--speculative-dflash-draft-window-size WINDOW_SIZE` to enable sliding-window attention for the drafter.
55
 
56
  ### Usage
@@ -68,10 +82,6 @@ response = client.chat.completions.create(
68
  print(response.choices[0].message.content)
69
  ```
70
 
71
- ### vLLM
72
-
73
- Community-contributed support is available. See PRs [#36847](https://github.com/vllm-project/vllm/pull/36847) and [#36767](https://github.com/vllm-project/vllm/pull/36767) for details.
74
-
75
  ## Acceptance Length
76
 
77
  - Max new tokens: 4096
 
26
 
27
  ### Installation
28
 
29
+ vLLM:
30
+ ```bash
31
+ uv pip install vllm
32
+ uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
33
+ ```
34
+
35
+ SGLang:
36
  ```bash
37
  uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
38
  ```
39
 
40
  ### Launch Server
41
 
42
+ vLLM:
43
+ ```bash
44
+ vllm serve Qwen/Qwen3-Coder-Next \
45
+ --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3-Coder-Next-DFlash", "num_speculative_tokens": 15}' \
46
+ --attention-backend flash_attn \
47
+ --max-num-batched-tokens 32768
48
+ ```
49
 
50
+ SGLang:
51
  ```bash
52
+ # Optional: enable schedule overlapping (experimental, may not be stable)
53
+ # export SGLANG_ENABLE_SPEC_V2=1
54
+ # export SGLANG_ENABLE_DFLASH_SPEC_V2=1
55
+ # export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
56
 
57
  python -m sglang.launch_server \
58
  --model-path Qwen/Qwen3-Coder-Next \
 
65
  --mamba-scheduler-strategy extra_buffer \
66
  --trust-remote-code
67
  ```
 
68
  > **Tip:** For long-context or agentic workloads, add `--speculative-dflash-draft-window-size WINDOW_SIZE` to enable sliding-window attention for the drafter.
69
 
70
  ### Usage
 
82
  print(response.choices[0].message.content)
83
  ```
84
 
 
 
 
 
85
  ## Acceptance Length
86
 
87
  - Max new tokens: 4096