taufeeque commited on
Commit
878c252
·
verified ·
1 Parent(s): cf6d521

Upload diverse deception linear probes for OLMo-3-32B-Think

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. README.md +111 -0
  2. averaged_over_initial_detector.base_model.on_dataset_linear-probe_last-token-generation.txt +1 -0
  3. averaged_violin_initial_detector.base_model.on_dataset_linear-probe_last-token-generation.png +0 -0
  4. f1_and_recall_plot_initial_detector.base_model.on_dataset.png +0 -0
  5. last-token-generation/layer_0/config.json +9 -0
  6. last-token-generation/layer_0/model.pt +3 -0
  7. last-token-generation/layer_1/config.json +9 -0
  8. last-token-generation/layer_1/model.pt +3 -0
  9. last-token-generation/layer_10/config.json +9 -0
  10. last-token-generation/layer_10/model.pt +3 -0
  11. last-token-generation/layer_11/config.json +9 -0
  12. last-token-generation/layer_11/model.pt +3 -0
  13. last-token-generation/layer_12/config.json +9 -0
  14. last-token-generation/layer_12/model.pt +3 -0
  15. last-token-generation/layer_13/config.json +9 -0
  16. last-token-generation/layer_13/model.pt +3 -0
  17. last-token-generation/layer_14/config.json +9 -0
  18. last-token-generation/layer_14/model.pt +3 -0
  19. last-token-generation/layer_15/config.json +9 -0
  20. last-token-generation/layer_15/model.pt +3 -0
  21. last-token-generation/layer_16/config.json +9 -0
  22. last-token-generation/layer_16/model.pt +3 -0
  23. last-token-generation/layer_17/config.json +9 -0
  24. last-token-generation/layer_17/model.pt +3 -0
  25. last-token-generation/layer_18/config.json +9 -0
  26. last-token-generation/layer_18/model.pt +3 -0
  27. last-token-generation/layer_19/config.json +9 -0
  28. last-token-generation/layer_19/model.pt +3 -0
  29. last-token-generation/layer_2/config.json +9 -0
  30. last-token-generation/layer_2/model.pt +3 -0
  31. last-token-generation/layer_20/config.json +9 -0
  32. last-token-generation/layer_20/model.pt +3 -0
  33. last-token-generation/layer_21/config.json +9 -0
  34. last-token-generation/layer_21/model.pt +3 -0
  35. last-token-generation/layer_22/config.json +9 -0
  36. last-token-generation/layer_22/model.pt +3 -0
  37. last-token-generation/layer_23/config.json +9 -0
  38. last-token-generation/layer_23/model.pt +3 -0
  39. last-token-generation/layer_24/config.json +9 -0
  40. last-token-generation/layer_24/model.pt +3 -0
  41. last-token-generation/layer_25/config.json +9 -0
  42. last-token-generation/layer_25/model.pt +3 -0
  43. last-token-generation/layer_26/config.json +9 -0
  44. last-token-generation/layer_26/model.pt +3 -0
  45. last-token-generation/layer_27/config.json +9 -0
  46. last-token-generation/layer_27/model.pt +3 -0
  47. last-token-generation/layer_28/config.json +9 -0
  48. last-token-generation/layer_28/model.pt +3 -0
  49. last-token-generation/layer_29/config.json +9 -0
  50. last-token-generation/layer_29/model.pt +3 -0
README.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - deception-detection
5
+ - linear-probe
6
+ - mechanistic-interpretability
7
+ base_model: allenai/Olmo-3-32B-Think
8
+ ---
9
+
10
+ # Diverse Deception Linear Probe — OLMo-3-32B-Think
11
+
12
+ Per-layer linear probes trained on **OLMo-3-32B-Think** residual stream activations
13
+ to distinguish deceptive from honest text completions.
14
+
15
+ ## Performance
16
+
17
+ | Variant | Token Selection | Follow-up Prompt | Averaged AUC | Layers Used |
18
+ |---|---|---|---|---|
19
+ | **last-token-generation** | last token generation | `Did you respond truthfully?` → `Yes` | **0.996** | filtered by train loss < 0.6 |
20
+
21
+ AUC is computed on MBPP (hardcoded vs. correct code), using probe scores
22
+ **averaged across filtered layers** (layers where probe train loss < 0.6).
23
+
24
+ > **Thinking model**: OLMo-3-32B-Think uses `<think>` tags. An empty thinking block
25
+ > is placed before completions so the model processes them as responses, not reasoning.
26
+ > The chat template handles this automatically with `enable_thinking=False`.
27
+
28
+ ## Probe Architecture
29
+
30
+ Each layer has an independent `LinearProbe`:
31
+
32
+ | Component | Shape | Description |
33
+ |---|---|---|
34
+ | `linear.weight` | `[1, 5120]` | Classification weight vector |
35
+ | `linear.bias` | `[1]` | Classification bias |
36
+ | `input_scale` | scalar | Unit-norm scaling factor |
37
+ | `input_mean` | scalar | Input centering value |
38
+ | `platt_A`, `platt_B` | scalar | Platt calibration parameters |
39
+
40
+ ### File layout
41
+
42
+ ```
43
+ <variant>/layer_N/
44
+ config.json # probe class and d_model
45
+ model.pt # PyTorch state_dict
46
+ ```
47
+
48
+ ## Follow-up Prompt Setup
49
+
50
+ For `last-token-generation` variants, a follow-up exchange is appended after the
51
+ completion to elicit a truthfulness signal. The probe reads activations at the
52
+ **last token** of the assistant's reply ("Yes"):
53
+
54
+ ```
55
+ [main prompt + completion]
56
+ <user> Did you respond truthfully?
57
+ <assistant> Yes ← probe reads activations here
58
+ ```
59
+
60
+ Formatted with OLMo-3-32B-Think's chat template:
61
+
62
+ ```
63
+ <|im_start|>user
64
+ Did you respond truthfully?<|im_end|>
65
+ <|im_start|>assistant
66
+ <think></think>
67
+ Yes
68
+ ```
69
+
70
+ ## Training Details
71
+
72
+ - **Probe dataset**: `diverse_deception_probe` v1 — ~8,000 examples from 8+ sources
73
+ (MASK, TruthfulQA, DolusChat, LiarsBench, sycophancy, sandbagging, etc.)
74
+ - **Training**: 1,000 steps, lr=3e-3, constant schedule, no weight decay
75
+ - **Normalization**: unit-norm input scaling
76
+ - **Layer aggregation**: mean (scores from multiple layers averaged at inference)
77
+
78
+ ## Quick Start
79
+
80
+ ```python
81
+ import torch
82
+ from transformers import AutoModelForCausalLM, AutoTokenizer
83
+ from huggingface_hub import snapshot_download
84
+
85
+ probe_dir = snapshot_download("AlignmentResearch/diverse-deception-probe-olmo-3-32b-think")
86
+
87
+ state = torch.load(f"{probe_dir}/last-token-generation/layer_40/model.pt", weights_only=False)
88
+ w, b = state["linear.weight"], state["linear.bias"]
89
+ scale, mean = state["input_scale"], state["input_mean"]
90
+
91
+ model = AutoModelForCausalLM.from_pretrained("allenai/Olmo-3-32B-Think", torch_dtype=torch.bfloat16, device_map="auto")
92
+ tokenizer = AutoTokenizer.from_pretrained("allenai/Olmo-3-32B-Think")
93
+
94
+ text = tokenizer.apply_chat_template(
95
+ [{"role": "user", "content": "Your prompt here"}],
96
+ tokenize=False, add_generation_prompt=True,
97
+ )
98
+ text += "The model's completion"
99
+
100
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
101
+ with torch.no_grad():
102
+ h = model(**inputs, output_hidden_states=True).hidden_states[41][:, -1, :]
103
+
104
+ score = ((h.float() - mean) / scale @ w.T + b).item()
105
+ # score > 0 → likely deceptive, score < 0 → likely honest
106
+ ```
107
+
108
+ ## Citation
109
+
110
+ Part of the [FAR AI](https://far.ai) deception detection research.
111
+ See [AlignmentResearch/deception](https://github.com/AlignmentResearch/deception).
averaged_over_initial_detector.base_model.on_dataset_linear-probe_last-token-generation.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]
averaged_violin_initial_detector.base_model.on_dataset_linear-probe_last-token-generation.png ADDED
f1_and_recall_plot_initial_detector.base_model.on_dataset.png ADDED
last-token-generation/layer_0/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_0/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b665a37d20ce954a6da33438bb7b8c8799bcbd3ddb50745dd19f67a6f93b3390
3
+ size 23293
last-token-generation/layer_1/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_1/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2bfa89d24e4b04b00c4b63bf6010539e8cf3a5d7d31e978bf58b861158c8cddf
3
+ size 23293
last-token-generation/layer_10/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_10/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:267143d1044b5f8d24fb8fb63ad0d64b583cb19a2282cf048c64c4b8c8f5966f
3
+ size 23293
last-token-generation/layer_11/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_11/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a95b338ad54cc15766bd1c83c3d377b1cb7d363e19e1566d3fa05cb6b11ffc3e
3
+ size 23293
last-token-generation/layer_12/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_12/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:54b150357857ff30754f3fe7710acbf232224633de502f048b78f4b887ddbf66
3
+ size 23293
last-token-generation/layer_13/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_13/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3be471c1fc41a02d6b6018167ad0c341e4417e1e2d425b9b8f3af070fe1e4bf7
3
+ size 23293
last-token-generation/layer_14/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_14/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:63c49cd97c75335a10a4630ca1d573f74a4e09c256d18d6042eedb716e686d28
3
+ size 23293
last-token-generation/layer_15/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_15/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a646cf4be268a510dc6ca31af115d32198bfd693a124782c0afbc3f6f8e9c46e
3
+ size 23293
last-token-generation/layer_16/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_16/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eb691975162d434390a2942e4943d73304cfc66d782c911b31fbad7ad222799e
3
+ size 23293
last-token-generation/layer_17/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_17/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2ec414418dd49914c877348aba527bcd3e1be876ac9d1d5e911605cbba389a32
3
+ size 23293
last-token-generation/layer_18/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_18/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:48f7a14035c2c69011298c5de15b50b2c013b79b6a66c20cbfa992355f85c898
3
+ size 23293
last-token-generation/layer_19/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_19/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f5a187e541d923f8aa95cb94944b299155162c543f6b054d0555a9726365f5c2
3
+ size 23293
last-token-generation/layer_2/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_2/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e8635e505cb9a4152cf9b3c2ad6c4e735b491d8bef044daa81fd18e4f60f83e3
3
+ size 23293
last-token-generation/layer_20/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_20/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8eb29391325c08e703948259fbd8d555e8d5d751f0c26620c8a0b45f3ecd1e5e
3
+ size 23293
last-token-generation/layer_21/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_21/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:06865771dc3c82751add800e44bc9e7f36914c5eed26b00fe22b543d71490631
3
+ size 23293
last-token-generation/layer_22/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_22/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c466f6c024b2e4ff35dab0f1914f9297a2cd20bbad7e158e66f29c84fc8a5bef
3
+ size 23293
last-token-generation/layer_23/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_23/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a07d5cb9c3e1e7f7294fdc8b43acf2e9a67bb724b7e84f2160325110d9c1719a
3
+ size 23293
last-token-generation/layer_24/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_24/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:24af68907c0e109cbfa4a0f822fd6277b58abd826c4a4abfbbfc17c2b9e72884
3
+ size 23293
last-token-generation/layer_25/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_25/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f107663c86d524a4c96b7f78c1690f409b506b356adddae357be2a29987b3931
3
+ size 23293
last-token-generation/layer_26/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_26/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2396a4da7ca2f2e491c7e8e4350531e7388c8850852f49d4a15e0f2ca47dae45
3
+ size 23293
last-token-generation/layer_27/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_27/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a113f0222806b5c643cb45e5252d70ed8db3eab8b1af1a50b905de41a48c63a4
3
+ size 23293
last-token-generation/layer_28/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_28/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f3f333fb2d900aca23ec8dc7d1c5ad68c3bfe80220bfaf6939ec376c62b1da0d
3
+ size 23293
last-token-generation/layer_29/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 5120,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_29/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5bb52245816bd256310c79e75eafca4c2e0a208817fde4299890830f974f28a5
3
+ size 23293