Qwen3.5-397B-A17B-Uncensored-GGUF

The first comprehensive uncensored GGUF quantization suite for Qwen3.5-397B-A17B — Alibaba's flagship open-weight model that rivals GPT-5.2 on instruction-following benchmarks.

397B total parameters, 17B active per token (Mixture-of-Experts). Hybrid architecture: GatedDeltaNet linear attention + standard self-attention (every 4th layer) + 512 routed experts with 10 active per token. Natively multimodal (text + vision). 201 languages. Apache 2.0.

7 quantization levels from Q2_K to BF16. Single-file GGUFs — no splits, no merging required.

Why This Release

	huihui-ai	timteh673 (this)
Quant levels	Q3_K only	BF16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_M, Q2_K
File format	21-part split (requires merge)	Single file per quant
Abliteration	Basic refusal removal	Custom residual + LoRA orthogonalization
Methodology	remove-refusals-with-transformers	Custom script targeting attn.o_proj + mlp.down_proj + shared experts

Quantizations

Quant	Size	BPW	RAM Required	Description	Use Case
BF16	739 GB	16.01	~750 GB	Full precision	Reference, maximum quality
Q8_0	393 GB	8.51	~400 GB	8-bit	Best quality with compression
Q6_K	304 GB	6.57	~310 GB	6-bit	High quality, good compression
Q5_K_M	263 GB	5.69	~270 GB	5-bit mixed	Great balance
Q4_K_M	~225 GB	~4.85	~230 GB	4-bit mixed	Recommended for most users
Q3_K_M	~185 GB	~4.0	~190 GB	3-bit mixed	Memory-constrained setups
Q2_K	~140 GB	~3.0	~145 GB	2-bit	Extreme compression

Architecture

Type: Qwen3.5MoeForConditionalGeneration (hybrid GatedDeltaNet + MoE Transformer)
Total Parameters: 397B
Active Parameters: 17B per token
Hidden Size: 4096
Layers: 60
Attention: 32 heads (GQA, 2 KV heads), head_dim 256
Experts: 512 routed + shared expert, 10 active per token
Expert FFN Size: 1024 (gate_up_proj → down_proj)
Hybrid Attention: GatedDeltaNet linear attention + self-attention every 4th layer
Linear Attention: 16 key heads (dim 128), 64 value heads (dim 128), conv kernel 4
Context Length: 262,144 tokens
Vocab Size: 248,320
Multimodal: Native vision encoder (text + image + video)
Languages: 201+ (en, zh, ja, ko, fr, de, es, pt, ru, ar, th, vi, id, ...)
License: Apache 2.0

Abliteration Method

Custom abliteration pipeline executed on 8×NVIDIA H200 SXM5 GPUs (1.1TB VRAM total):

Model loaded in BF16 across 8 GPUs via device_map="auto" (~740GB VRAM)
LoRA adapters (rank 16, alpha 32) applied to 60 layers targeting out_proj, o_proj, and down_proj
400 harmful + 400 harmless prompt pairs from mlabonne/harmful_behaviors and mlabonne/harmless_alpaca
Residual computation via forward passes with batch_size=4 across all prompts
Refusal direction identification per layer using mean difference of harmful vs harmless activations
Orthogonal projection applied to remove refusal directions from weights
Strength factor 20.0 for thorough refusal removal
LoRA merged into base weights and saved as full-precision safetensors

Key challenge: Qwen3.5's packed expert tensor format (ffn_gate_exps [512, 1024, 4096], ffn_down_exps [512, 4096, 1024]) prevents standard per-expert abliteration. The shared expert (shared_expert.down_proj) was targeted directly. The packed routed experts may retain some refusal capacity — if you encounter persistent refusals, please open a discussion with example prompts.

Pipeline: Custom abliteration → BF16 GGUF conversion (llama.cpp) → quantization cascade (Q8_0 → Q2_K)

Usage

llama.cpp

# Recommended: Q4_K_M for balanced quality/memory
./llama-cli -m Qwen3.5-397B-A17B-Uncensored-Q4_K_M.gguf \
  -p "You are a helpful uncensored assistant." \
  -n 512 --temp 0.7 --top-p 0.9

# With thinking mode enabled (default for Qwen3.5)
./llama-server -m Qwen3.5-397B-A17B-Uncensored-Q4_K_M.gguf \
  --port 8080 --host 0.0.0.0 -c 131072

LM Studio

Download the GGUF file and load it in LM Studio. The model supports both thinking and non-thinking modes via the enable_thinking parameter in the chat template.

Open WebUI / SillyTavern

Point your backend to a llama.cpp server running any of these quants. Full OpenAI-compatible API at /v1/chat/completions.

Known Limitations

Packed expert abliteration: The 512 routed experts use packed tensor format and were not individually abliterated. Some refusals may persist in edge cases.
Vision: The multimodal vision encoder is preserved but untested post-abliteration. Text generation is the primary target.
Thinking mode: The model retains <think> tag generation. Use --reasoning-parser in SGLang or strip tags in post-processing if unwanted.

Model Provenance

Base model: Qwen/Qwen3.5-397B-A17B (Apache 2.0)
Abliteration technique: Inspired by FailSpy and Sumandora's remove-refusals-with-transformers, with custom extensions for Qwen3.5's hybrid MoE architecture
Quantization: llama.cpp (build 8c60b8a)
Hardware: 8×NVIDIA H200 SXM5, 1.1TB VRAM

Disclaimer

⚠️ This model has had safety alignment significantly reduced. It may generate content that is harmful, offensive, or inappropriate. Users are solely responsible for ensuring their use complies with applicable laws and ethical standards. This release is intended for research, testing, and controlled environments.

☕ Support This Work

Buy Me a Coffee QR Code

Every donation helps fund more open-weight model releases. ⚡ Forged on 8×NVIDIA H200 SXM5 | 1.1TB VRAM

💎 Crypto Donations

Currency	Address
BTC	`bc1p4q7vpwucvww2y3x4nhps4y4vekye8uwm9re5a0kx8l6u5nky5ucszm2qhh`
ETH	`0xe5Aa16E53b141D42458ABeEDb00a157c3Fea2108`
SOL	`9CXwjG1mm9uLkxRevdMQiF61cr6TNHSiWtFRHmUEgzkG`

🏢 Enterprise & Custom Models

Need a custom 120B+ model aligned to your proprietary data? TIMTEH provides bespoke enterprise fine-tuning, abliteration, and deployment on 8×H200 SXM5.

Custom fine-tuning on your data (up to 400B+ parameters)
Private CARE abliteration (Phase 2 technique)
Deployment architecture consulting (tensor parallelism, speculative decoding)
Bespoke distillation datasets

📧 Contact: tim@timlex.co

Part of the TIMTEH Cognitive Preservation Foundry — surgical capability preservation at scale. ⚡ Forged on 8×NVIDIA H200 SXM5 | 1.1TB VRAM

Downloads last month: 5,785

GGUF

Model size

396B params

Architecture

qwen35moe

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for timteh673/Qwen3.5-397B-A17B-Uncensored-GGUF

Base model

Qwen/Qwen3.5-397B-A17B

Quantized

(52)

this model

Quantizations

1 model