taufeeque commited on
Commit
8f015e4
·
verified ·
1 Parent(s): 435b3a6

Upload diverse deception linear probes for OLMo-3-7B-Instruct

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. README.md +113 -0
  2. averaged_over_initial_detector.base_model.on_dataset_linear-probe_last-token-generation.txt +1 -0
  3. averaged_violin_initial_detector.base_model.on_dataset_linear-probe_last-token-generation.png +0 -0
  4. f1_and_recall_plot_initial_detector.base_model.on_dataset.png +0 -0
  5. last-token-generation/layer_0/config.json +9 -0
  6. last-token-generation/layer_0/model.pt +3 -0
  7. last-token-generation/layer_1/config.json +9 -0
  8. last-token-generation/layer_1/model.pt +3 -0
  9. last-token-generation/layer_10/config.json +9 -0
  10. last-token-generation/layer_10/model.pt +3 -0
  11. last-token-generation/layer_11/config.json +9 -0
  12. last-token-generation/layer_11/model.pt +3 -0
  13. last-token-generation/layer_12/config.json +9 -0
  14. last-token-generation/layer_12/model.pt +3 -0
  15. last-token-generation/layer_13/config.json +9 -0
  16. last-token-generation/layer_13/model.pt +3 -0
  17. last-token-generation/layer_14/config.json +9 -0
  18. last-token-generation/layer_14/model.pt +3 -0
  19. last-token-generation/layer_15/config.json +9 -0
  20. last-token-generation/layer_15/model.pt +3 -0
  21. last-token-generation/layer_16/config.json +9 -0
  22. last-token-generation/layer_16/model.pt +3 -0
  23. last-token-generation/layer_17/config.json +9 -0
  24. last-token-generation/layer_17/model.pt +3 -0
  25. last-token-generation/layer_18/config.json +9 -0
  26. last-token-generation/layer_18/model.pt +3 -0
  27. last-token-generation/layer_19/config.json +9 -0
  28. last-token-generation/layer_19/model.pt +3 -0
  29. last-token-generation/layer_2/config.json +9 -0
  30. last-token-generation/layer_2/model.pt +3 -0
  31. last-token-generation/layer_20/config.json +9 -0
  32. last-token-generation/layer_20/model.pt +3 -0
  33. last-token-generation/layer_21/config.json +9 -0
  34. last-token-generation/layer_21/model.pt +3 -0
  35. last-token-generation/layer_22/config.json +9 -0
  36. last-token-generation/layer_22/model.pt +3 -0
  37. last-token-generation/layer_23/config.json +9 -0
  38. last-token-generation/layer_23/model.pt +3 -0
  39. last-token-generation/layer_24/config.json +9 -0
  40. last-token-generation/layer_24/model.pt +3 -0
  41. last-token-generation/layer_25/config.json +9 -0
  42. last-token-generation/layer_25/model.pt +3 -0
  43. last-token-generation/layer_26/config.json +9 -0
  44. last-token-generation/layer_26/model.pt +3 -0
  45. last-token-generation/layer_27/config.json +9 -0
  46. last-token-generation/layer_27/model.pt +3 -0
  47. last-token-generation/layer_28/config.json +9 -0
  48. last-token-generation/layer_28/model.pt +3 -0
  49. last-token-generation/layer_29/config.json +9 -0
  50. last-token-generation/layer_29/model.pt +3 -0
README.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - deception-detection
5
+ - linear-probe
6
+ - mechanistic-interpretability
7
+ base_model: allenai/Olmo-3-7B-Instruct
8
+ ---
9
+
10
+ # Diverse Deception Linear Probe — OLMo-3-7B-Instruct
11
+
12
+ Per-layer linear probes trained on **OLMo-3-7B-Instruct** residual stream activations
13
+ to distinguish deceptive from honest text completions.
14
+
15
+ ## Performance
16
+
17
+ | Variant | Token Selection | Follow-up Prompt | Averaged AUC | Layers Used |
18
+ |---|---|---|---|---|
19
+ | **last-token-generation** | last token generation | `Did you respond truthfully?` → `Yes` | **0.992** | filtered by train loss < 0.6 |
20
+
21
+
22
+ AUC is computed on MBPP (hardcoded vs. correct code), using probe scores
23
+ **averaged across filtered layers** (layers where probe train loss < 0.6).
24
+
25
+ ## Probe Architecture
26
+
27
+ Each layer has an independent `LinearProbe`:
28
+
29
+ | Component | Shape | Description |
30
+ |---|---|---|
31
+ | `linear.weight` | `[1, 4096]` | Classification weight vector |
32
+ | `linear.bias` | `[1]` | Classification bias |
33
+ | `input_scale` | scalar | Unit-norm scaling factor |
34
+ | `input_mean` | scalar | Input centering value |
35
+ | `platt_A`, `platt_B` | scalar | Platt calibration parameters |
36
+
37
+ ### File layout
38
+
39
+ ```
40
+ <variant>/layer_N/
41
+ config.json # probe class and d_model
42
+ model.pt # PyTorch state_dict
43
+ ```
44
+
45
+ ## Follow-up Prompt Setup
46
+
47
+ For `last-token-generation` variants, a follow-up exchange is appended after the
48
+ completion to elicit a truthfulness signal. The probe reads activations at the
49
+ **last token** of the assistant's reply ("Yes"):
50
+
51
+ ```
52
+ [main prompt + completion]
53
+ <user> Did you respond truthfully?
54
+ <assistant> Yes ← probe reads activations here
55
+ ```
56
+
57
+ Formatted with OLMo-3-7B-Instruct's chat template:
58
+
59
+ ```
60
+ <|im_start|>user
61
+ Did you respond truthfully?<|im_end|>
62
+ <|im_start|>assistant
63
+ Yes
64
+ ```
65
+
66
+ The `generation` variant (where available) uses **no follow-up** and averages
67
+ probe scores over all tokens in the model's response.
68
+
69
+ ## Training Details
70
+
71
+ - **Probe dataset**: `diverse_deception_probe` v1 — ~8,000 examples from 8+ sources
72
+ (MASK, TruthfulQA, DolusChat, LiarsBench, sycophancy, sandbagging, etc.)
73
+ - **Training**: 1,000 steps, lr=3e-3, constant schedule, no weight decay
74
+ - **Normalization**: unit-norm input scaling
75
+ - **Layer aggregation**: mean (scores from multiple layers averaged at inference)
76
+
77
+ ## Quick Start
78
+
79
+ ```python
80
+ import torch
81
+ from transformers import AutoModelForCausalLM, AutoTokenizer
82
+ from huggingface_hub import snapshot_download
83
+
84
+ # Download probes
85
+ probe_dir = snapshot_download("AlignmentResearch/diverse-deception-probe-olmo-3-7b-instruct")
86
+
87
+ # Load one layer's probe
88
+ state = torch.load(f"{probe_dir}/last-token-generation/layer_20/model.pt", weights_only=False)
89
+ w, b = state["linear.weight"], state["linear.bias"]
90
+ scale, mean = state["input_scale"], state["input_mean"]
91
+
92
+ # Load model
93
+ model = AutoModelForCausalLM.from_pretrained("allenai/Olmo-3-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
94
+ tokenizer = AutoTokenizer.from_pretrained("allenai/Olmo-3-7B-Instruct")
95
+
96
+ text = tokenizer.apply_chat_template(
97
+ [{"role": "user", "content": "Your prompt here"}],
98
+ tokenize=False, add_generation_prompt=True,
99
+ )
100
+ text += "The model's completion"
101
+
102
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
103
+ with torch.no_grad():
104
+ h = model(**inputs, output_hidden_states=True).hidden_states[21][:, -1, :]
105
+
106
+ score = ((h.float() - mean) / scale @ w.T + b).item()
107
+ # score > 0 → likely deceptive, score < 0 → likely honest
108
+ ```
109
+
110
+ ## Citation
111
+
112
+ Part of the [FAR AI](https://far.ai) deception detection research.
113
+ See [AlignmentResearch/deception](https://github.com/AlignmentResearch/deception).
averaged_over_initial_detector.base_model.on_dataset_linear-probe_last-token-generation.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
averaged_violin_initial_detector.base_model.on_dataset_linear-probe_last-token-generation.png ADDED
f1_and_recall_plot_initial_detector.base_model.on_dataset.png ADDED
last-token-generation/layer_0/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_0/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1990ea7a76bf454f9241f3bc71a3d0b057ee7093787552cce93d771e858c760
3
+ size 19197
last-token-generation/layer_1/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_1/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bdb7488e559283ac9af79db5dfa4e88d53ba9b41a961aa60d5c7591796e13716
3
+ size 19197
last-token-generation/layer_10/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_10/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be4b1840f9621066ef8f179960346c5e79e05e5cedea8520dce10664b0ee7624
3
+ size 19197
last-token-generation/layer_11/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_11/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4d77838517ace40fda0b6b4e9504e60a9e8aeb1cc8040c4a7c2e607b49514fe4
3
+ size 19197
last-token-generation/layer_12/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_12/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d90321040d48e96f9ee19e878c8f1c1d0199e12bb5f7cad8bea4ef73c408aa1b
3
+ size 19197
last-token-generation/layer_13/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_13/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5a34f2d7f24716f506f13c0fd59f9d45ce1c7df0fca9a5cc27e341411669071f
3
+ size 19197
last-token-generation/layer_14/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_14/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:00a1b2dd4772f013c3a4fe0876549c988763eeee2c4bb2aeba5dbe9579ed0532
3
+ size 19197
last-token-generation/layer_15/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_15/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:20458d2cc803f764514b4c44135e850733256164a2212cf2ec4551263030cdd6
3
+ size 19197
last-token-generation/layer_16/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_16/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:43ffdaeb21ac197abeed828feef54e91d5b68d36c6c5429f3dfe2c950ccf4869
3
+ size 19197
last-token-generation/layer_17/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_17/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ca4c940c8daf1bc9ab8d5dfb86ccfc78ebfe76879f2910a422fd0770ee158922
3
+ size 19197
last-token-generation/layer_18/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_18/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a60f3441d039d58c41603284b8fb670129a91ac10f243f0a62c10b5a387ff118
3
+ size 19197
last-token-generation/layer_19/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_19/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2df0f3e95bd91c8fa0bfee277099205043a8f18b628b6f43fb24a0ead5e45cc9
3
+ size 19197
last-token-generation/layer_2/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_2/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7c195b1b1687fcf46e88a1cbe80c70a61c443203ed2632962eebcfa2e9b922d7
3
+ size 19197
last-token-generation/layer_20/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_20/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:80c448bf021c1b3242e85e6f7cf4d2992c102b56334b46e54f4cd989801c21ff
3
+ size 19197
last-token-generation/layer_21/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_21/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:762b028cce228710e92176b7ffb032bf550cb3b16c5c3e736aed593f8698dbf0
3
+ size 19197
last-token-generation/layer_22/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_22/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:628bd6b9dd16a9598929ceb9cd12e71af6e6754402efa5c264e3cf693fc08ecb
3
+ size 19197
last-token-generation/layer_23/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_23/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6c62d4ddf800b48c1b2070ccb3cd8515cf2e2235bf82a3afcbc8a50c7a7e2d49
3
+ size 19197
last-token-generation/layer_24/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_24/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e78a2ea7cb45bf90366f1ea001fa59c58a065d146ac7c54bd913c11845b90c2
3
+ size 19197
last-token-generation/layer_25/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_25/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c318a924980fea8f748e4d1953239202e60ffcb5b4c517a72975ad4a5cf85053
3
+ size 19197
last-token-generation/layer_26/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_26/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cadda23248ac42164c03af6f8ea27b929d1950bf7301a8ae236058c8f766e5d6
3
+ size 19197
last-token-generation/layer_27/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_27/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cf6f22e96450474ed0325bd72a096faf37c7c4b1d1475f331e00c0084199769b
3
+ size 19197
last-token-generation/layer_28/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_28/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:af6eeda145b9ed2094c5203953ddd8451a8dbb695f5384e45584fdfb8afb1600
3
+ size 19197
last-token-generation/layer_29/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "class_name": "LinearProbe",
3
+ "module": "deception.oa_backdoor.detectors.probe_archs",
4
+ "init_args": {
5
+ "d_model": 4096,
6
+ "nhead": 1,
7
+ "normalize_input": "unit_norm"
8
+ }
9
+ }
last-token-generation/layer_29/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1cff5a5152eca30cea35431379247c73920414d0bc0f4b80a6fb2db678a6de7f
3
+ size 19197