Qwen3.5-397B-A17B-Uncensored-GGUF
The first comprehensive uncensored GGUF quantization suite for Qwen3.5-397B-A17B โ Alibaba's flagship open-weight model that rivals GPT-5.2 on instruction-following benchmarks.
397B total parameters, 17B active per token (Mixture-of-Experts). Hybrid architecture: GatedDeltaNet linear attention + standard self-attention (every 4th layer) + 512 routed experts with 10 active per token. Natively multimodal (text + vision). 201 languages. Apache 2.0.
7 quantization levels from Q2_K to BF16. Single-file GGUFs โ no splits, no merging required.
Why This Release
| huihui-ai | timteh673 (this) | |
|---|---|---|
| Quant levels | Q3_K only | BF16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_M, Q2_K |
| File format | 21-part split (requires merge) | Single file per quant |
| Abliteration | Basic refusal removal | Custom residual + LoRA orthogonalization |
| Methodology | remove-refusals-with-transformers | Custom script targeting attn.o_proj + mlp.down_proj + shared experts |
Quantizations
| Quant | Size | BPW | RAM Required | Description | Use Case |
|---|---|---|---|---|---|
| BF16 | 739 GB | 16.01 | ~750 GB | Full precision | Reference, maximum quality |
| Q8_0 | 393 GB | 8.51 | ~400 GB | 8-bit | Best quality with compression |
| Q6_K | 304 GB | 6.57 | ~310 GB | 6-bit | High quality, good compression |
| Q5_K_M | 263 GB | 5.69 | ~270 GB | 5-bit mixed | Great balance |
| Q4_K_M | ~225 GB | ~4.85 | ~230 GB | 4-bit mixed | Recommended for most users |
| Q3_K_M | ~185 GB | ~4.0 | ~190 GB | 3-bit mixed | Memory-constrained setups |
| Q2_K | ~140 GB | ~3.0 | ~145 GB | 2-bit | Extreme compression |
Architecture
- Type: Qwen3.5MoeForConditionalGeneration (hybrid GatedDeltaNet + MoE Transformer)
- Total Parameters: 397B
- Active Parameters: 17B per token
- Hidden Size: 4096
- Layers: 60
- Attention: 32 heads (GQA, 2 KV heads), head_dim 256
- Experts: 512 routed + shared expert, 10 active per token
- Expert FFN Size: 1024 (gate_up_proj โ down_proj)
- Hybrid Attention: GatedDeltaNet linear attention + self-attention every 4th layer
- Linear Attention: 16 key heads (dim 128), 64 value heads (dim 128), conv kernel 4
- Context Length: 262,144 tokens
- Vocab Size: 248,320
- Multimodal: Native vision encoder (text + image + video)
- Languages: 201+ (en, zh, ja, ko, fr, de, es, pt, ru, ar, th, vi, id, ...)
- License: Apache 2.0
Abliteration Method
Custom abliteration pipeline executed on 8รNVIDIA H200 SXM5 GPUs (1.1TB VRAM total):
- Model loaded in BF16 across 8 GPUs via
device_map="auto"(~740GB VRAM) - LoRA adapters (rank 16, alpha 32) applied to 60 layers targeting
out_proj,o_proj, anddown_proj - 400 harmful + 400 harmless prompt pairs from mlabonne/harmful_behaviors and mlabonne/harmless_alpaca
- Residual computation via forward passes with batch_size=4 across all prompts
- Refusal direction identification per layer using mean difference of harmful vs harmless activations
- Orthogonal projection applied to remove refusal directions from weights
- Strength factor 20.0 for thorough refusal removal
- LoRA merged into base weights and saved as full-precision safetensors
Key challenge: Qwen3.5's packed expert tensor format (ffn_gate_exps [512, 1024, 4096], ffn_down_exps [512, 4096, 1024]) prevents standard per-expert abliteration. The shared expert (shared_expert.down_proj) was targeted directly. The packed routed experts may retain some refusal capacity โ if you encounter persistent refusals, please open a discussion with example prompts.
Pipeline: Custom abliteration โ BF16 GGUF conversion (llama.cpp) โ quantization cascade (Q8_0 โ Q2_K)
Usage
llama.cpp
# Recommended: Q4_K_M for balanced quality/memory
./llama-cli -m Qwen3.5-397B-A17B-Uncensored-Q4_K_M.gguf \
-p "You are a helpful uncensored assistant." \
-n 512 --temp 0.7 --top-p 0.9
# With thinking mode enabled (default for Qwen3.5)
./llama-server -m Qwen3.5-397B-A17B-Uncensored-Q4_K_M.gguf \
--port 8080 --host 0.0.0.0 -c 131072
LM Studio
Download the GGUF file and load it in LM Studio. The model supports both thinking and non-thinking modes via the enable_thinking parameter in the chat template.
Open WebUI / SillyTavern
Point your backend to a llama.cpp server running any of these quants. Full OpenAI-compatible API at /v1/chat/completions.
Known Limitations
- Packed expert abliteration: The 512 routed experts use packed tensor format and were not individually abliterated. Some refusals may persist in edge cases.
- Vision: The multimodal vision encoder is preserved but untested post-abliteration. Text generation is the primary target.
- Thinking mode: The model retains
<think>tag generation. Use--reasoning-parserin SGLang or strip tags in post-processing if unwanted.
Model Provenance
- Base model: Qwen/Qwen3.5-397B-A17B (Apache 2.0)
- Abliteration technique: Inspired by FailSpy and Sumandora's remove-refusals-with-transformers, with custom extensions for Qwen3.5's hybrid MoE architecture
- Quantization: llama.cpp (build 8c60b8a)
- Hardware: 8รNVIDIA H200 SXM5, 1.1TB VRAM
Disclaimer
โ ๏ธ This model has had safety alignment significantly reduced. It may generate content that is harmful, offensive, or inappropriate. Users are solely responsible for ensuring their use complies with applicable laws and ethical standards. This release is intended for research, testing, and controlled environments.
โ Support This Work
Every donation helps fund more open-weight model releases. โก Forged on 8รNVIDIA H200 SXM5 | 1.1TB VRAM
๐ Crypto Donations
| Currency | Address |
|---|---|
| BTC | bc1p4q7vpwucvww2y3x4nhps4y4vekye8uwm9re5a0kx8l6u5nky5ucszm2qhh |
| ETH | 0xe5Aa16E53b141D42458ABeEDb00a157c3Fea2108 |
| SOL | 9CXwjG1mm9uLkxRevdMQiF61cr6TNHSiWtFRHmUEgzkG |
๐ข Enterprise & Custom Models
Need a custom 120B+ model aligned to your proprietary data? TIMTEH provides bespoke enterprise fine-tuning, abliteration, and deployment on 8รH200 SXM5.
- Custom fine-tuning on your data (up to 400B+ parameters)
- Private CARE abliteration (Phase 2 technique)
- Deployment architecture consulting (tensor parallelism, speculative decoding)
- Bespoke distillation datasets
๐ง Contact: tim@timlex.co
Part of the TIMTEH Cognitive Preservation Foundry โ surgical capability preservation at scale. โก Forged on 8รNVIDIA H200 SXM5 | 1.1TB VRAM
- Downloads last month
- 5,785
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit