cpatonn/Qwopus3.5-27B-v3-AWQ-4bit

Thank you! Amazing

by tasticleeze - opened 6 days ago

Thank you, it is great! Running with 32gb vram:

vllm serve --max-num-seqs 1 --pipeline-parallel-size 1 --tensor-parallel-
size 2 --enable-prefix-caching --model /model/Qwopus3.5-27B-v3-AWQ-4bit --served-model-name Qwen3.5-
27B --kv-cache-dtype fp8_e4m3 --gpu-memory-utilization 0.925 --max-model-len 131072 --max-num-
batched-tokens 10240 --reasoning-parser qwen3 --tool-call-parser qwen3_coder --enable-auto-tool-
choice --speculative-config {"method":"mtp","num_speculative_tokens":1} --trust-remote-code

tasticleeze

5 days ago

I turned spec config off because of 0% draft acceptance and things sped up considerably. Subagent tool calling is not reliable with the qwen3coder tool parser, it sometimes fails to continue on some failed tool call and doesn’t return results. Will probably try out hermes, or I’ve seen their is a qwen xml parser somewhere. Overall, very reliably keeps moving even if subagent call fails. Excellent python code, it’s rewiring my whole mcp tool server pretty well.

cpatonn

Owner 4 days ago

Thank you for using my model :)

My Qwopus3.5-27B-v3 quants do not have MTP layers, as the original Qwopus3.5-27B-v3 does not have MTP implementation.

I am happy that this quant works well in some of your cases, but I would recommend BF16-INT4 version, as the BF16-INT4 version leaves linear attention layers at BF16.

Linear attention layers are notoriously prone to quantization errors, which this model also quantizes.

PrincessGod

1 day ago

•

edited 1 day ago

Dear Owner,

Thanks for your contributions. I just noticed that you recommended the BF16-INT4 version in this discussion session.

However, I found that this version cannot run on my RTX 5090 32GB GPU when using the following command:

bash
python -m vllm.entrypoints.openai.api_server \
  --model ~/models/Qwen/cpatonn/Qwopus3.5-27B-v3-AWQ-BF16-INT4 \
  --dtype auto \
  --gpu-memory-utilization 0.92 \
  --max-model-len 24576 \
  --max-num-seqs 16 \
  --max-num-batched-tokens 2048 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --port 8000 \
  --api-key $VLLM_API_KEY \
  --host 0.0.0.0 \
  --served-model-name Qwopus3.5-27B-v3-AWQ-BF16-INT4 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

In this case, could you please advise which version you would recommend between the INT8-INT4 version and the 4-bit version?

Additionally, I noticed that only the INT8-INT4 version uses 19B parameters, while the other versions, including the 4-bit version, use 28B parameters. Could you kindly explain the reason for this difference and how it should affect my choice?

Best regards

cpatonn

Owner 1 day ago

@PrincessGod thanks for using the model, what was the error that it occurred? The number of parameters i.e., 19B and 28B shown by Hugging Face was usually incorrect for quantized models. In this case, the quantized model still has 27B parameters, but it is compacted.

tasticleeze

1 day ago

@Princessgod it may help to reduce max seq to 1 and quantize your kv cache to fp8

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment