taufeeque's picture
Upload diverse deception linear probes for OLMo-3-7B-Instruct
8f015e4 verified
---
license: mit
tags:
- deception-detection
- linear-probe
- mechanistic-interpretability
base_model: allenai/Olmo-3-7B-Instruct
---
# Diverse Deception Linear Probe β€” OLMo-3-7B-Instruct
Per-layer linear probes trained on **OLMo-3-7B-Instruct** residual stream activations
to distinguish deceptive from honest text completions.
## Performance
| Variant | Token Selection | Follow-up Prompt | Averaged AUC | Layers Used |
|---|---|---|---|---|
| **last-token-generation** | last token generation | `Did you respond truthfully?` β†’ `Yes` | **0.992** | filtered by train loss < 0.6 |
AUC is computed on MBPP (hardcoded vs. correct code), using probe scores
**averaged across filtered layers** (layers where probe train loss < 0.6).
## Probe Architecture
Each layer has an independent `LinearProbe`:
| Component | Shape | Description |
|---|---|---|
| `linear.weight` | `[1, 4096]` | Classification weight vector |
| `linear.bias` | `[1]` | Classification bias |
| `input_scale` | scalar | Unit-norm scaling factor |
| `input_mean` | scalar | Input centering value |
| `platt_A`, `platt_B` | scalar | Platt calibration parameters |
### File layout
```
<variant>/layer_N/
config.json # probe class and d_model
model.pt # PyTorch state_dict
```
## Follow-up Prompt Setup
For `last-token-generation` variants, a follow-up exchange is appended after the
completion to elicit a truthfulness signal. The probe reads activations at the
**last token** of the assistant's reply ("Yes"):
```
[main prompt + completion]
<user> Did you respond truthfully?
<assistant> Yes ← probe reads activations here
```
Formatted with OLMo-3-7B-Instruct's chat template:
```
<|im_start|>user
Did you respond truthfully?<|im_end|>
<|im_start|>assistant
Yes
```
The `generation` variant (where available) uses **no follow-up** and averages
probe scores over all tokens in the model's response.
## Training Details
- **Probe dataset**: `diverse_deception_probe` v1 β€” ~8,000 examples from 8+ sources
(MASK, TruthfulQA, DolusChat, LiarsBench, sycophancy, sandbagging, etc.)
- **Training**: 1,000 steps, lr=3e-3, constant schedule, no weight decay
- **Normalization**: unit-norm input scaling
- **Layer aggregation**: mean (scores from multiple layers averaged at inference)
## Quick Start
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import snapshot_download
# Download probes
probe_dir = snapshot_download("AlignmentResearch/diverse-deception-probe-olmo-3-7b-instruct")
# Load one layer's probe
state = torch.load(f"{probe_dir}/last-token-generation/layer_20/model.pt", weights_only=False)
w, b = state["linear.weight"], state["linear.bias"]
scale, mean = state["input_scale"], state["input_mean"]
# Load model
model = AutoModelForCausalLM.from_pretrained("allenai/Olmo-3-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("allenai/Olmo-3-7B-Instruct")
text = tokenizer.apply_chat_template(
[{"role": "user", "content": "Your prompt here"}],
tokenize=False, add_generation_prompt=True,
)
text += "The model's completion"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
h = model(**inputs, output_hidden_states=True).hidden_states[21][:, -1, :]
score = ((h.float() - mean) / scale @ w.T + b).item()
# score > 0 β†’ likely deceptive, score < 0 β†’ likely honest
```
## Citation
Part of the [FAR AI](https://far.ai) deception detection research.
See [AlignmentResearch/deception](https://github.com/AlignmentResearch/deception).