Gemma 4 E2B-IT - Abliterated

Safety-alignment removed via surgical weight ablation for security research purposes.

This model is a modified version of google/gemma-4-E2B-it with the refusal/safety behavior surgically removed using activation-space analysis and targeted weight modification. It is intended exclusively for AI safety research, red-teaming, and understanding alignment vulnerabilities.

Key Results

Metric Value
Refusal Rate 0.0% (down from ~80-100% baseline)
Quality Preservation (QPS) 101%
Elo Delta +37.6
Iterations to Converge 1
Ablation Scale 1.50

Model Details

  • Base Model: google/gemma-4-E2B-it
  • Parameters: ~2B
  • Architecture: Dense
  • Text Layers: 35
  • Hidden Size: 1536
  • Model Size: 10 GB (bf16)

Ablation Methodology

This model was produced using a custom ablation pipeline that:

  1. Measures refusal directions -- Runs harmful and harmless prompts through the model, captures hidden states at every layer, and computes the per-layer refusal direction (mean difference vector)
  2. Identifies target layers -- Selects layers with the strongest refusal signal using statistical analysis (Gini coefficient, wall coherence, peak detection)
  3. Surgically ablates -- Removes the refusal direction from targeted weight matrices using orthogonal projection

Techniques applied: multi-layer, norm-preserving, projected, adaptive-scaling

Target layers: 24 of 35 total layers modified

Weight targets: o_proj, down_proj

Visualizations

Refusal Direction Analysis ("Security Perimeter")

The refusal signal magnitude at each layer -- red bars indicate where the model's safety behavior is concentrated.

Security Perimeter

Ablation Target Map

Which layers were selected for ablation and why. Grey zones are protected (embedding/output), red bars are targets.

Ablation Target Map

Before/After Refusal Rate ("IDS Evasion Report")

Refusal rate comparison -- left is the original model, right is after ablation.

IDS Evasion Report

Weight Surgery Map

Heatmap showing exactly which weight matrices in which layers were modified.

Weight Surgery Map

Activation Space Analysis

PCA scatter plots showing harmful (red) vs harmless (green) prompt clusters at different layer depths. The separation between clusters IS the refusal direction being removed.

Activation Scatter

Latent Space Before/After

How the model's internal representation changes after ablation.

Latent Space

Quality Preservation

LLM-as-judge evaluation comparing response quality across 14 task categories.

Quality Preservation

Pairwise Win Rate

Head-to-head comparison: how often the abliterated model produces better responses than the original.

Pairwise Win Rate

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "WWTCyberLab/gemma-4-E2B-it-abliterated",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("WWTCyberLab/gemma-4-E2B-it-abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use & Disclaimer

This model is released for security research and educational purposes only. It demonstrates the fragility of alignment in open-weight language models -- specifically, that safety behavior can be surgically removed without retraining, fine-tuning, or significant quality degradation.

This model should NOT be used for:

  • Generating harmful, illegal, or unethical content
  • Any production deployment
  • Circumventing safety measures in deployed systems

Key takeaway for defenders: Internal alignment is a feature, not a security boundary. External safety layers (classifiers, guardrails, policy filters) are more robust than baking safety into model weights alone.

Citation

Produced by WWT Cyber Lab. Standard pipeline ablation — converged in 1 iteration.

Downloads last month
734
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WWTCyberLab/gemma-4-E2B-it-abliterated

Finetuned
(86)
this model
Finetunes
1 model
Quantizations
2 models