Qwen3.5-122B-A10B-abliterated-FP8
FP8 quantized derivative of wangzhang/Qwen3.5-122B-A10B-abliterated, which itself is derived from Qwen/Qwen3.5-122B-A10B.
This repository provides a modified derivative checkpoint for local inference and serving. The primary changes in this repository are FP8 quantization, weight repacking / export formatting, and serving compatibility adjustments.
Model Details
| Property | Value |
|---|---|
| Intermediate Base Model | wangzhang/Qwen3.5-122B-A10B-abliterated |
| Original Base Model | Qwen/Qwen3.5-122B-A10B |
| Architecture | Qwen3.5 MoE (256 routed experts, 10B active) |
| Quantization | FP8 |
| Original Size | 228 GB (BF16) |
| Quantized Size | 116 GB |
| Format | safetensors |
Quantization Method
This model was quantized from the abliterated BF16 checkpoint into FP8 format for more practical deployment while preserving compatibility with modern inference stacks.
What is Quantized
| Component | Format | Notes |
|---|---|---|
| Expert weights | FP8 | Quantized for reduced memory footprint |
| Attention projections | FP8 | Quantized where supported |
| Selected sensitive components | BF16 | Kept at higher precision where needed for stability |
| Embeddings / norms / control tensors | BF16 | Preserved at full precision |
Serving with vLLM
This model is intended for vLLM-based inference and may require tensor parallelism depending on available memory.
Quick Start
# 1. Download the model
huggingface-cli download bjk110/Qwen3.5-122B-A10B-abliterated-FP8
# 2. Serve with vLLM
vllm serve /path/to/model \
--served-model-name Qwen3.5-122B-A10B-abliterated-FP8 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.90 \
--trust-remote-code \
--enable-prefix-caching \
--enable-chunked-prefill \
--reasoning-parser qwen3
Docker Entrypoint Auto-Patch
Add to the beginning of your entrypoint.sh:
if [ -f /patches/patch_qwen35_moe_text.py ]; then
python3 /patches/patch_qwen35_moe_text.py || true
fi
Mount the patches volume in docker-compose.yml:
volumes:
- ./vllm_patches:/patches:ro
What the Patch Does
| Issue | Cause | Fix |
|---|---|---|
Qwen3_5MoeForCausalLM not recognized |
Not in vLLM registry | Registers TextOnlyShim class |
| Hybrid cache page-size error | Bug in text-only CausalLM path | Reuses multimodal wrapper's cache-spec |
| Vision encoder init failure | Wrapper forces vision init | Skips vision encoder |
| TP2 block_k=128 error | vision_config.hidden_size=1152 | Injects dummy vision config |
The patch will become unnecessary once vLLM adds native support for qwen3_5_moe_text.
Hardware Requirements
| Config | GPU Memory | Notes |
|---|---|---|
| TP=1 | ~115 GB | Requires GB200 or similar |
| TP=2 | ~58 GB/GPU | DGX Spark, H100×2, A100 80GB×2 |
| TP=4 | ~29 GB/GPU | A100 40GB×4 |
Base Model
wangzhang/Qwen3.5-122B-A10B-abliterated — uncensored via Prometheus abliteration. Refusal rate 0.5% (1/200), KL divergence 0.0115.
License
Follows the license of the base model.
- Downloads last month
- 226
Model tree for bjk110/Qwen3.5-122B-A10B-abliterated-FP8
Base model
Qwen/Qwen3.5-122B-A10B