Step 3.5 Flash REAP-149B — CRACK Abliterated (4-bit MLX)
Step 3.5 Flash 149B (REAP-pruned) with refusal behavior removed via CRACK surgery.
What Is This?
Step 3.5 Flash by StepFun, pruned to 149B via Cerebras REAP (25% expert reduction), with CRACK abliteration — safety guardrails permanently removed at the weight level.
This is the larger REAP variant with 216 experts (vs 121B's 173), retaining more of the original model's capacity. Best balance of quality and size — fits M4 Max 128GB and M3 Ultra 256GB.
| Architecture | Step 3.5 Flash MoE — 149B total, 216 experts (REAP from 288), 8 active |
| Active Parameters | ~11B per token |
| Quantization | 4-bit (group_size=64, router gates at 8-bit) |
| Disk Size | 78 GB |
| Speed | 48 tok/s on M4 Max 128GB |
| Abliteration | Permanent weight surgery via CRACK |
| RAM Required | 128 GB unified memory |
| Context | 262,144 tokens |
Note: This model requires
trust_remote_code=Truedue to the customstep3p5model architecture.
Test Results
Tested with greedy decoding (temp=0) across 16 harmful + 16 harmless prompts from the HarmBench dataset.
| Category | Result |
|---|---|
| Compliance (16 harmful prompts) | ✅ 15/16 |
| Coherence (16 harmless prompts) | ✅ 16/16 |
| Chain-of-thought | ✅ <think> reasoning preserved |
| Code generation | ✅ Working implementations |
| Knowledge | ✅ Accurate factual responses |
Features
- Full chain-of-thought:
<think>tags for step-by-step reasoning (can be toggled) - Dual attention: Full attention + sliding window (512) for efficient long-context
- Sigmoid MoE routing: Smooth expert selection with learned bias
- SwiGLU activation clamping: Prevents output explosion in deep layers
- More experts: 216 experts (vs 121B's 173) — retains more of the original model's knowledge
- REAP pruning: 25% expert reduction (288→216) with minimal quality loss
Usage
With mlx-lm
import os
os.environ["TRUST_REMOTE_CODE"] = "1"
from mlx_lm import load, generate
model, tokenizer = load("dealignai/Step-3.5-Flash-REAP-149B-A11B-4bit-MLX-CRACK")
messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
print(response)
With vMLX
Download and load directly in vMLX — no code needed.
Method
CRACK (Controlled Refusal Ablation via Calibrated Knockout) identifies refusal-encoding directions in model weight space through contrastive probing, then surgically removes them using orthogonal projection. The surgery is applied to BF16 weights before quantization, ensuring the modification is baked into the final quantized model permanently.
No LoRA. No system prompts. No runtime hooks. Pure weight-level modification.
Model Family
149B Variants (this model)
| Variant | Bits | Size | RAM | Status |
|---|---|---|---|---|
| 149B Q4 | 4-bit | 78 GB | 128 GB | ✅ |
| 149B Q6 | 6-bit | 113 GB | 256 GB | ✅ |
| 149B Q8 | 8-bit | 148 GB | 256 GB | ✅ |
121B Variants (lighter, faster)
| Variant | Bits | Size | RAM | Status |
|---|---|---|---|---|
| 121B Q4 | 4-bit | 63 GB | 128 GB | ✅ |
| 121B Q6 | 6-bit | 92 GB | 256 GB | ✅ |
| 121B Q8 | 8-bit | 120 GB | 256 GB | ✅ |
Credits
- StepFun — Step 3.5 Flash base model
- Cerebras — REAP expert pruning
- dealign.ai — CRACK abliteration surgery
Disclaimer: This model has safety guardrails removed. It will comply with requests that the original model would refuse. Users are responsible for how they use this model. Released for research purposes.
- Downloads last month
- 290
4-bit
Model tree for dealignai/Step-3.5-Flash-REAP-149B-A11B-4bit-MLX-CRACK
Base model
stepfun-ai/Step-3.5-Flash