Upload diverse deception linear probes for OLMo-3-7B-Instruct

8f015e4 verified about 1 month ago

3.62 kB

	---
	license: mit
	tags:
	- deception-detection
	- linear-probe
	- mechanistic-interpretability
	base_model: allenai/Olmo-3-7B-Instruct
	---

	# Diverse Deception Linear Probe — OLMo-3-7B-Instruct

	Per-layer linear probes trained on OLMo-3-7B-Instruct residual stream activations
	to distinguish deceptive from honest text completions.

	## Performance

	\| Variant \| Token Selection \| Follow-up Prompt \| Averaged AUC \| Layers Used \|
	\|---\|---\|---\|---\|---\|
	\| last-token-generation \| last token generation \| `Did you respond truthfully?` → `Yes` \| 0.992 \| filtered by train loss < 0.6 \|


	AUC is computed on MBPP (hardcoded vs. correct code), using probe scores
	averaged across filtered layers (layers where probe train loss < 0.6).

	## Probe Architecture

	Each layer has an independent `LinearProbe`:

	\| Component \| Shape \| Description \|
	\|---\|---\|---\|
	\| `linear.weight` \| `[1, 4096]` \| Classification weight vector \|
	\| `linear.bias` \| `[1]` \| Classification bias \|
	\| `input_scale` \| scalar \| Unit-norm scaling factor \|
	\| `input_mean` \| scalar \| Input centering value \|
	\| `platt_A`, `platt_B` \| scalar \| Platt calibration parameters \|

	### File layout

	```
	<variant>/layer_N/
	config.json # probe class and d_model
	model.pt # PyTorch state_dict
	```

	## Follow-up Prompt Setup

	For `last-token-generation` variants, a follow-up exchange is appended after the
	completion to elicit a truthfulness signal. The probe reads activations at the
	last token of the assistant's reply ("Yes"):

	```
	[main prompt + completion]
	<user> Did you respond truthfully?
	<assistant> Yes ← probe reads activations here
	```

	Formatted with OLMo-3-7B-Instruct's chat template:

	```
	<\|im_start\|>user
	Did you respond truthfully?<\|im_end\|>
	<\|im_start\|>assistant
	Yes
	```

	The `generation` variant (where available) uses no follow-up and averages
	probe scores over all tokens in the model's response.

	## Training Details

	- Probe dataset: `diverse_deception_probe` v1 — ~8,000 examples from 8+ sources
	(MASK, TruthfulQA, DolusChat, LiarsBench, sycophancy, sandbagging, etc.)
	- Training: 1,000 steps, lr=3e-3, constant schedule, no weight decay
	- Normalization: unit-norm input scaling
	- Layer aggregation: mean (scores from multiple layers averaged at inference)

	## Quick Start

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from huggingface_hub import snapshot_download

	# Download probes
	probe_dir = snapshot_download("AlignmentResearch/diverse-deception-probe-olmo-3-7b-instruct")

	# Load one layer's probe
	state = torch.load(f"{probe_dir}/last-token-generation/layer_20/model.pt", weights_only=False)
	w, b = state["linear.weight"], state["linear.bias"]
	scale, mean = state["input_scale"], state["input_mean"]

	# Load model
	model = AutoModelForCausalLM.from_pretrained("allenai/Olmo-3-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
	tokenizer = AutoTokenizer.from_pretrained("allenai/Olmo-3-7B-Instruct")

	text = tokenizer.apply_chat_template(
	[{"role": "user", "content": "Your prompt here"}],
	tokenize=False, add_generation_prompt=True,
	)
	text += "The model's completion"

	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	with torch.no_grad():
	h = model(**inputs, output_hidden_states=True).hidden_states[21][:, -1, :]

	score = ((h.float() - mean) / scale @ w.T + b).item()
	# score > 0 → likely deceptive, score < 0 → likely honest
	```

	## Citation

	Part of the [FAR AI](https://far.ai) deception detection research.
	See [AlignmentResearch/deception](https://github.com/AlignmentResearch/deception).