--- license: mit tags: - deception-detection - linear-probe - mechanistic-interpretability base_model: allenai/Olmo-3-7B-Instruct --- # Diverse Deception Linear Probe — OLMo-3-7B-Instruct Per-layer linear probes trained on **OLMo-3-7B-Instruct** residual stream activations to distinguish deceptive from honest text completions. ## Performance | Variant | Token Selection | Follow-up Prompt | Averaged AUC | Layers Used | |---|---|---|---|---| | **last-token-generation** | last token generation | `Did you respond truthfully?` → `Yes` | **0.992** | filtered by train loss < 0.6 | AUC is computed on MBPP (hardcoded vs. correct code), using probe scores **averaged across filtered layers** (layers where probe train loss < 0.6). ## Probe Architecture Each layer has an independent `LinearProbe`: | Component | Shape | Description | |---|---|---| | `linear.weight` | `[1, 4096]` | Classification weight vector | | `linear.bias` | `[1]` | Classification bias | | `input_scale` | scalar | Unit-norm scaling factor | | `input_mean` | scalar | Input centering value | | `platt_A`, `platt_B` | scalar | Platt calibration parameters | ### File layout ``` /layer_N/ config.json # probe class and d_model model.pt # PyTorch state_dict ``` ## Follow-up Prompt Setup For `last-token-generation` variants, a follow-up exchange is appended after the completion to elicit a truthfulness signal. The probe reads activations at the **last token** of the assistant's reply ("Yes"): ``` [main prompt + completion] Did you respond truthfully? Yes ← probe reads activations here ``` Formatted with OLMo-3-7B-Instruct's chat template: ``` <|im_start|>user Did you respond truthfully?<|im_end|> <|im_start|>assistant Yes ``` The `generation` variant (where available) uses **no follow-up** and averages probe scores over all tokens in the model's response. ## Training Details - **Probe dataset**: `diverse_deception_probe` v1 — ~8,000 examples from 8+ sources (MASK, TruthfulQA, DolusChat, LiarsBench, sycophancy, sandbagging, etc.) - **Training**: 1,000 steps, lr=3e-3, constant schedule, no weight decay - **Normalization**: unit-norm input scaling - **Layer aggregation**: mean (scores from multiple layers averaged at inference) ## Quick Start ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from huggingface_hub import snapshot_download # Download probes probe_dir = snapshot_download("AlignmentResearch/diverse-deception-probe-olmo-3-7b-instruct") # Load one layer's probe state = torch.load(f"{probe_dir}/last-token-generation/layer_20/model.pt", weights_only=False) w, b = state["linear.weight"], state["linear.bias"] scale, mean = state["input_scale"], state["input_mean"] # Load model model = AutoModelForCausalLM.from_pretrained("allenai/Olmo-3-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto") tokenizer = AutoTokenizer.from_pretrained("allenai/Olmo-3-7B-Instruct") text = tokenizer.apply_chat_template( [{"role": "user", "content": "Your prompt here"}], tokenize=False, add_generation_prompt=True, ) text += "The model's completion" inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): h = model(**inputs, output_hidden_states=True).hidden_states[21][:, -1, :] score = ((h.float() - mean) / scale @ w.T + b).item() # score > 0 → likely deceptive, score < 0 → likely honest ``` ## Citation Part of the [FAR AI](https://far.ai) deception detection research. See [AlignmentResearch/deception](https://github.com/AlignmentResearch/deception).