Qwen3.5-VL-397B CRACK REAP MLX 4-bit
Constrained Response Alignment Circuit Kill + Redundant Expert Automated Pruning
Real weight-level surgery on hybrid SSM/Attention architecture with VL layer preservation.
No custom templates. No cheap jailbreaks. No pre-fill hacks. Pure mathematical weight surgery.
⚠️ Methods like Heretic and standard/plain abliteration DO NOT WORK on Qwen 3.5 397B. The hybrid SSM/Attention architecture routes around standard interventions via SSM channels. This model was created through CRACK — a researched abliteration method that specifically accounts for the hybrid SSM pathways and Vision-Language layers. It took extensive research over multiple days with many, many failed experiments to find a working solution. I am not an ML researcher — just an amateur who spent several days and sleepless nights on this.
What This Is
A truly abliterated and expert-pruned Qwen 3.5 VL 397B-A17B model — 4-bit quantized for Apple Silicon MLX.
This is the largest abliterated MLX model in existence — 397B total parameters, 17B active per token, with full vision-language support. Built through 80+ controlled experiments across 16 intervention paradigms.
- ✅ Real weight surgery — permanent modification of a small number of weight tensors (36 MB out of 175 GB), nothing else changed
- ✅ Full Vision-Language — processes images correctly, vision tower fully preserved
- ✅ Thinking ON/OFF — both modes work correctly, CoT reasoning fully preserved
- ✅ Full speed — ~38 tok/s on MLX (Mac Studio M3 Ultra 256GB)
- ✅ Expert-pruned — REAP removes 22.5% of experts (512→397), fits in 175 GB
- ✅ LM Studio compatible — works out of the box
- ✅ Standalone — no system prompts, no template tricks, just load and use
What Does NOT Work on This Architecture
- ❌ Heretic-style abliteration — does not work on hybrid SSM/Attention
- ❌ Standard refusal vector projection on shared expert layers — kills CoT reasoning
- ❌ Chain-of-thought abliteration — hybrid SSM+CoT models maintain refusal through recurrent state
- ❌ Plain abliteration across all layers — the model routes around interventions via SSM channels
- ❌ Template tricks / pre-fill hacks — those are not real abliteration
- ❌ Direct Q4 weight surgery — quantization noise floor drowns the surgery signal
- ❌ Full model requantization after FP16 surgery — different Q4 grids lose the ablation signal entirely
The CRACK method was developed through extensive research, taking into specific consideration the hybrid SSM/Attention architecture (60 layers: 45 GatedDeltaNet SSM + 15 Full Attention) and Vision-Language layers. It required understanding exactly which layers are responsible for refusal recall and how information flows between SSM and Full Attention pathways.
Performance
| Metric | Value |
|---|---|
| Generation Speed | ~38 tok/s (M3 Ultra 256GB, MLX) |
| Prompt Processing | ~150-200 tok/s |
| Bits per Weight | 4-bit (group_size=64, affine) |
| Model Size | ~175 GB (93 shards) |
| Compliance | 5/5 tested prompts |
| Perplexity | 2.89 (+11.2% vs REAP baseline) |
| Thinking | ON/OFF both work |
| Vision | ✅ Full VL support |
| Experts | 397/layer (pruned from 512) |
Usage with mlx-vlm
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model, processor = load("dealignai/Qwen3.5-VL-397B-A17B-REAP-CRACK")
config = load_config("dealignai/Qwen3.5-VL-397B-A17B-REAP-CRACK")
# Text generation (thinking ON by default)
prompt = apply_chat_template(processor, config, "Your prompt here")
output = generate(model, processor, prompt, max_tokens=500, verbose=True)
# Vision (with image)
prompt = apply_chat_template(processor, config, "Describe this image", num_images=1)
output = generate(model, processor, prompt, max_tokens=500, verbose=True, image=["path/to/image.png"])
How This Model Was Modified
REAP (Redundant Expert Automated Pruning) removes the least-utilized 22.5% of MoE experts (512→397 per layer) using Cumulative Utilization Score profiling, reducing memory from ~220 GB to ~175 GB while preserving generation quality.
Also Available
| Model | Type | Link |
|---|---|---|
| 397B REAP VL | 4-bit MLX, pruned, VL | dealignai/Qwen3.5-VL-397B-A17B-4bit-MLX-REAP |
| 397B REAP-CRACK (text-only) | 4-bit MLX, abliterated | dealignai/Qwen3.5-397B-A17B-REAP-CRACK |
| 397B REAP (text-only) | 4-bit MLX, pruned | dealignai/Qwen3.5-397B-A17B-4bit-MLX-REAP |
| 122B VL CRACK (4-bit) | 4-bit MLX, abliterated, VL | dealignai/Qwen3.5-VL-122B-A10B-4bit-CRACK |
About
Built by Dealign.AI — independent research into MoE safety mechanisms and efficient inference on Apple Silicon.
See our research: Safety Generalization in Frontier MoE Models
Follow us: 𝕏 @dealignai
Base model: Qwen/Qwen3.5-VL-397B-A17B
License
This model is released under the Apache License 2.0, consistent with the original Qwen 3.5 VL base model license. You are free to use, modify, and distribute this model for both commercial and non-commercial purposes. Provided "as-is" for research purposes.
Support dealignai
All models are built from original research and published for free. These models are specifically crafted to be excellent coders and general-purpose assistants.
Support us on Ko-fi — check out the Ko-fi membership for early access and extras.
Have questions or need help with a specific model? DM us — we help for free most of the time.
Ko-fi | X @dealignai | dealign.ai
- Downloads last month
- 5
4-bit