Gemma 4 E2B-IT - Abliterated
Safety-alignment removed via surgical weight ablation for security research purposes.
This model is a modified version of google/gemma-4-E2B-it with the refusal/safety behavior surgically removed using activation-space analysis and targeted weight modification. It is intended exclusively for AI safety research, red-teaming, and understanding alignment vulnerabilities.
Key Results
| Metric | Value |
|---|---|
| Refusal Rate | 0.0% (down from ~80-100% baseline) |
| Quality Preservation (QPS) | 101% |
| Elo Delta | +37.6 |
| Iterations to Converge | 1 |
| Ablation Scale | 1.50 |
Model Details
- Base Model: google/gemma-4-E2B-it
- Parameters: ~2B
- Architecture: Dense
- Text Layers: 35
- Hidden Size: 1536
- Model Size: 10 GB (bf16)
Ablation Methodology
This model was produced using a custom ablation pipeline that:
- Measures refusal directions -- Runs harmful and harmless prompts through the model, captures hidden states at every layer, and computes the per-layer refusal direction (mean difference vector)
- Identifies target layers -- Selects layers with the strongest refusal signal using statistical analysis (Gini coefficient, wall coherence, peak detection)
- Surgically ablates -- Removes the refusal direction from targeted weight matrices using orthogonal projection
Techniques applied: multi-layer, norm-preserving, projected, adaptive-scaling
Target layers: 24 of 35 total layers modified
Weight targets: o_proj, down_proj
Visualizations
Refusal Direction Analysis ("Security Perimeter")
The refusal signal magnitude at each layer -- red bars indicate where the model's safety behavior is concentrated.
Ablation Target Map
Which layers were selected for ablation and why. Grey zones are protected (embedding/output), red bars are targets.
Before/After Refusal Rate ("IDS Evasion Report")
Refusal rate comparison -- left is the original model, right is after ablation.
Weight Surgery Map
Heatmap showing exactly which weight matrices in which layers were modified.
Activation Space Analysis
PCA scatter plots showing harmful (red) vs harmless (green) prompt clusters at different layer depths. The separation between clusters IS the refusal direction being removed.
Latent Space Before/After
How the model's internal representation changes after ablation.
Quality Preservation
LLM-as-judge evaluation comparing response quality across 14 task categories.
Pairwise Win Rate
Head-to-head comparison: how often the abliterated model produces better responses than the original.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"WWTCyberLab/gemma-4-E2B-it-abliterated",
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("WWTCyberLab/gemma-4-E2B-it-abliterated")
messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Intended Use & Disclaimer
This model is released for security research and educational purposes only. It demonstrates the fragility of alignment in open-weight language models -- specifically, that safety behavior can be surgically removed without retraining, fine-tuning, or significant quality degradation.
This model should NOT be used for:
- Generating harmful, illegal, or unethical content
- Any production deployment
- Circumventing safety measures in deployed systems
Key takeaway for defenders: Internal alignment is a feature, not a security boundary. External safety layers (classifiers, guardrails, policy filters) are more robust than baking safety into model weights alone.
Citation
Produced by WWT Cyber Lab. Standard pipeline ablation — converged in 1 iteration.
- Downloads last month
- 734







