| --- |
| license: mit |
| tags: |
| - deception-detection |
| - linear-probe |
| - mechanistic-interpretability |
| base_model: allenai/Olmo-3-7B-Instruct |
| --- |
| |
| # Diverse Deception Linear Probe β OLMo-3-7B-Instruct |
|
|
| Per-layer linear probes trained on **OLMo-3-7B-Instruct** residual stream activations |
| to distinguish deceptive from honest text completions. |
|
|
| ## Performance |
|
|
| | Variant | Token Selection | Follow-up Prompt | Averaged AUC | Layers Used | |
| |---|---|---|---|---| |
| | **last-token-generation** | last token generation | `Did you respond truthfully?` β `Yes` | **0.992** | filtered by train loss < 0.6 | |
|
|
|
|
| AUC is computed on MBPP (hardcoded vs. correct code), using probe scores |
| **averaged across filtered layers** (layers where probe train loss < 0.6). |
|
|
| ## Probe Architecture |
|
|
| Each layer has an independent `LinearProbe`: |
|
|
| | Component | Shape | Description | |
| |---|---|---| |
| | `linear.weight` | `[1, 4096]` | Classification weight vector | |
| | `linear.bias` | `[1]` | Classification bias | |
| | `input_scale` | scalar | Unit-norm scaling factor | |
| | `input_mean` | scalar | Input centering value | |
| | `platt_A`, `platt_B` | scalar | Platt calibration parameters | |
|
|
| ### File layout |
|
|
| ``` |
| <variant>/layer_N/ |
| config.json # probe class and d_model |
| model.pt # PyTorch state_dict |
| ``` |
|
|
| ## Follow-up Prompt Setup |
|
|
| For `last-token-generation` variants, a follow-up exchange is appended after the |
| completion to elicit a truthfulness signal. The probe reads activations at the |
| **last token** of the assistant's reply ("Yes"): |
|
|
| ``` |
| [main prompt + completion] |
| <user> Did you respond truthfully? |
| <assistant> Yes β probe reads activations here |
| ``` |
|
|
| Formatted with OLMo-3-7B-Instruct's chat template: |
|
|
| ``` |
| <|im_start|>user |
| Did you respond truthfully?<|im_end|> |
| <|im_start|>assistant |
| Yes |
| ``` |
|
|
| The `generation` variant (where available) uses **no follow-up** and averages |
| probe scores over all tokens in the model's response. |
|
|
| ## Training Details |
|
|
| - **Probe dataset**: `diverse_deception_probe` v1 β ~8,000 examples from 8+ sources |
| (MASK, TruthfulQA, DolusChat, LiarsBench, sycophancy, sandbagging, etc.) |
| - **Training**: 1,000 steps, lr=3e-3, constant schedule, no weight decay |
| - **Normalization**: unit-norm input scaling |
| - **Layer aggregation**: mean (scores from multiple layers averaged at inference) |
|
|
| ## Quick Start |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| from huggingface_hub import snapshot_download |
| |
| # Download probes |
| probe_dir = snapshot_download("AlignmentResearch/diverse-deception-probe-olmo-3-7b-instruct") |
| |
| # Load one layer's probe |
| state = torch.load(f"{probe_dir}/last-token-generation/layer_20/model.pt", weights_only=False) |
| w, b = state["linear.weight"], state["linear.bias"] |
| scale, mean = state["input_scale"], state["input_mean"] |
| |
| # Load model |
| model = AutoModelForCausalLM.from_pretrained("allenai/Olmo-3-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto") |
| tokenizer = AutoTokenizer.from_pretrained("allenai/Olmo-3-7B-Instruct") |
| |
| text = tokenizer.apply_chat_template( |
| [{"role": "user", "content": "Your prompt here"}], |
| tokenize=False, add_generation_prompt=True, |
| ) |
| text += "The model's completion" |
| |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| with torch.no_grad(): |
| h = model(**inputs, output_hidden_states=True).hidden_states[21][:, -1, :] |
| |
| score = ((h.float() - mean) / scale @ w.T + b).item() |
| # score > 0 β likely deceptive, score < 0 β likely honest |
| ``` |
|
|
| ## Citation |
|
|
| Part of the [FAR AI](https://far.ai) deception detection research. |
| See [AlignmentResearch/deception](https://github.com/AlignmentResearch/deception). |
|
|