Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Paper • 2512.15674 • Published
This is a LoRA adapter that turns gemma-4-E2B-it into an activation oracle -- an LLM that can read and interpret the internal activations of other LLMs (or itself) in natural language.
An activation oracle is trained to accept another model's hidden-state activations (injected via activation steering) and answer questions about them:
This enables interpretability research without access to the target model's logits or generated text -- only its internal representations.
Paper: Activation Oracles (arXiv:2512.15674)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-E2B-it",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-E2B-it")
# Load the activation oracle LoRA
model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-gemma-4-E2B-it-step-30000")
model.eval()
| Parameter | Value |
|---|---|
| Base model | google/gemma-4-E2B-it |
| Adapter | LoRA |
| Training tasks | LatentQA, classification, PastLens (next-token), SAE features |
| Activation injection | Steering vectors at intermediate layers |
| Layer coverage | 25%, 50%, 75% depth |
The oracle is trained on a mixture of:
Base model
google/gemma-4-E2B-it