lighteternal commited on
Commit
6ee95bc
·
verified ·
1 Parent(s): ed8c6e3

Publish broader prepared-data compatibility model as stable public artifact

Browse files
Files changed (4) hide show
  1. README.md +244 -0
  2. best_model.pt +3 -0
  3. training_metadata.json +1 -0
  4. training_summary.json +1 -0
README.md ADDED
@@ -0,0 +1,244 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - bioassay
4
+ - chemistry
5
+ - drug-discovery
6
+ - ranking
7
+ - assay-conditioning
8
+ - rdkit
9
+ - qwen
10
+ library_name: pytorch
11
+ license: mit
12
+ ---
13
+
14
+ # BioAssayAlign Qwen3-Embedding-0.6B Compatibility
15
+
16
+ BioAssayAlign is an assay-conditioned small-molecule ranking model.
17
+
18
+ Given:
19
+ - an assay description with optional metadata
20
+ - a list of candidate SMILES strings
21
+
22
+ it returns:
23
+ - one compatibility score per molecule
24
+ - a ranked shortlist for that assay
25
+
26
+ This is a ranking model. It is not a chatbot, not a generative chemistry model, and not a direct potency regressor.
27
+
28
+ ## What This Model Is For
29
+
30
+ Use it when you already have a candidate list and want to answer:
31
+
32
+ > For this assay, which molecules should I screen first?
33
+
34
+ Reasonable uses:
35
+ - shortlist triage before wet-lab work
36
+ - compare candidate sets against the same assay
37
+ - add a ranking feature to a downstream discovery workflow
38
+
39
+ Not reasonable uses:
40
+ - treating the score as a probability of success
41
+ - predicting exact IC50 / EC50 / Ki
42
+ - using it as a stand-alone medicinal chemistry decision engine
43
+
44
+ ## Model Design
45
+
46
+ This artifact uses:
47
+ - assay encoder: `Qwen/Qwen3-Embedding-0.6B` (frozen)
48
+ - molecule representation:
49
+ - Morgan fingerprints, radii `2` and `3`, `2048` bits each
50
+ - chirality-aware fingerprints
51
+ - MACCS keys
52
+ - `30` RDKit descriptors
53
+ - compatibility head:
54
+ - trainable projection layers
55
+ - metadata features on the assay side
56
+ - scorer MLP
57
+
58
+ The final score comes from the learned compatibility head. It is not just a raw embedding dot product.
59
+
60
+ ## Training Data
61
+
62
+ Training uses a frozen public assay-compound corpus derived from:
63
+ - PubChem BioAssay
64
+ - ChEMBL
65
+
66
+ The published artifact was trained on a prepared compatibility subset with:
67
+ - assays: `11,195`
68
+ - candidate-pool rows: `1,432,532`
69
+ - training groups: `508,216`
70
+
71
+ Held-out assay split:
72
+ - train assays: `8,967`
73
+ - validation assays: `1,117`
74
+ - test assays: `1,111`
75
+
76
+ Each training group contains:
77
+ - one assay
78
+ - one active compound
79
+ - multiple explicit same-assay inactive compounds
80
+
81
+ This means the model is trained for assay-conditioned ranking, not generic text retrieval.
82
+
83
+ ## Main Results
84
+
85
+ Best validation checkpoint:
86
+ - validation mean AUPRC: `0.6214`
87
+ - best epoch: `9`
88
+
89
+ Held-out test metrics:
90
+ - test mean AUPRC: `0.6339`
91
+ - test random-baseline AUPRC: `0.2749`
92
+ - test hit@10: `0.9739`
93
+ - test mean AUROC: `0.7815`
94
+ - test mean nDCG@50: `0.7250`
95
+
96
+ Interpretation:
97
+ - the model substantially beats the random ranking baseline
98
+ - it is strongest as a relative ranking tool over the submitted candidate list
99
+
100
+ ## Real Example Predictions
101
+
102
+ These were produced locally from the published weights and metadata only.
103
+
104
+ ### Example 1: JAK2 cell-based assay
105
+
106
+ Assay:
107
+ - title: `JAK2 inhibition assay`
108
+ - description: `Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.`
109
+ - organism: `Homo sapiens`
110
+ - readout: `luminescence`
111
+ - assay format: `cell-based`
112
+ - assay type: `inhibition`
113
+ - target UniProt: `O60674`
114
+
115
+ Candidate list ranked by the model:
116
+ 1. `CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1` → `-16.94`
117
+ 2. `c1ccccc1` → `-31.24`
118
+ 3. `CCO` → `-41.44`
119
+ 4. `CCOc1ccc2nc(N3CCN(C)CC3)n(C)c(=O)c2c1` → `-45.44`
120
+
121
+ ### Example 2: ALDH1A1 fluorescence assay
122
+
123
+ Assay:
124
+ - title: `ALDH1A1 inhibition assay`
125
+ - description: `Cell-based fluorescence assay measuring ALDH1A1 inhibition in human cells.`
126
+ - organism: `Homo sapiens`
127
+ - readout: `fluorescence`
128
+ - assay format: `cell-based`
129
+ - assay type: `inhibition`
130
+ - target UniProt: `P00352`
131
+
132
+ Candidate list ranked by the model:
133
+ 1. `CCOc1ccccc1` → `-30.87`
134
+ 2. `CCN(CC)CCOc1ccccc1` → `-34.09`
135
+ 3. `Cc1cc(=O)n(C)c(=O)[nH]1` → `-34.33`
136
+ 4. `CCO` → `-37.07`
137
+
138
+ The raw values above are model scores. In practice, read them as list-relative ranking values, not calibrated probabilities.
139
+
140
+ ## How To Run It Locally
141
+
142
+ ### Minimal local check from this repo
143
+
144
+ This downloads only the model weights and metadata, not the raw assay dataset.
145
+
146
+ ```bash
147
+ MODEL_REPO_ID='lighteternal/BioAssayAlign-Qwen3-Embedding-0.6B-Compatibility' \
148
+ LOCAL_MODEL_DIR='data/hf_compat_model_check' \
149
+ bash scripts/score_compatibility_from_hf.sh \
150
+ --assay-title 'JAK2 inhibition assay' \
151
+ --description 'Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.' \
152
+ --organism 'Homo sapiens' \
153
+ --readout 'luminescence' \
154
+ --assay-format 'cell-based' \
155
+ --assay-type 'inhibition' \
156
+ --target-uniprot O60674 \
157
+ --smiles 'CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1' \
158
+ --smiles 'c1ccccc1' \
159
+ --smiles 'CCO'
160
+ ```
161
+
162
+ ### Python usage
163
+
164
+ ```python
165
+ from bioassayalign.compat_inference import (
166
+ AssayQuery,
167
+ load_compatibility_model,
168
+ rank_compounds,
169
+ serialize_assay_query,
170
+ )
171
+
172
+ model = load_compatibility_model("/path/to/model_dir")
173
+ assay_text = serialize_assay_query(
174
+ AssayQuery(
175
+ title="JAK2 inhibition assay",
176
+ description="Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.",
177
+ organism="Homo sapiens",
178
+ readout="luminescence",
179
+ assay_format="cell-based",
180
+ assay_type="inhibition",
181
+ target_uniprot=["O60674"],
182
+ )
183
+ )
184
+
185
+ results = rank_compounds(
186
+ model,
187
+ assay_text=assay_text,
188
+ smiles_list=[
189
+ "CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1",
190
+ "c1ccccc1",
191
+ "CCO",
192
+ ],
193
+ )
194
+ ```
195
+
196
+ ## Input Guidance
197
+
198
+ Best practice:
199
+ - provide structured assay fields
200
+ - use chemically sensible parent SMILES
201
+ - include target UniProt IDs when the assay is target-defined
202
+
203
+ Recommended assay fields:
204
+ - title
205
+ - description
206
+ - organism
207
+ - readout
208
+ - assay format
209
+ - assay type
210
+ - target UniProt IDs
211
+
212
+ The model is reasonably robust to wording changes, but missing metadata can still reduce quality.
213
+
214
+ ## How To Read The Output
215
+
216
+ Use the output in this order:
217
+ 1. relative ranking within your submitted list
218
+ 2. top-K shortlist
219
+ 3. chemistry context columns such as molecular weight, logP, TPSA
220
+ 4. raw model score only for debugging
221
+
222
+ The score is:
223
+ - useful for comparing molecules within the same submitted list
224
+ - not meaningful as a global absolute “quality” number across unrelated lists
225
+
226
+ ## Limitations
227
+
228
+ - The score is not a calibrated probability.
229
+ - The model does not predict exact potency.
230
+ - The benchmark is assay-held-out, not a full unseen-scaffold universal benchmark.
231
+ - Public assay data contains label noise and heterogeneous assay protocols.
232
+ - Some assays remain difficult and produce only moderate ranking quality.
233
+
234
+ ## Provenance
235
+
236
+ Project code:
237
+ - `https://github.com/lighteternal/bioassayalign-private`
238
+
239
+ Model files in this repo:
240
+ - `best_model.pt`
241
+ - `training_metadata.json`
242
+ - `training_summary.json`
243
+
244
+ If the public assay-ranking model is updated later, this repo keeps the same public identity while the internal experiment lineage can continue separately.
best_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e836ad1a7fa6dc405a46ca47edc712fd5121ad639dbc51632773802cdcf2a18f
3
+ size 19601186
training_metadata.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"compat_prepared_manifest_sha256":"d90827950b1bdb4389264593f537d114274a6803654a8e54895c8de2bab5f75a","config":{"assay_batch_size":64,"assay_metadata_dim":128,"assay_model_name":"Qwen/Qwen3-Embedding-0.6B","assay_task_description":"Given a bioassay description and metadata, represent the assay for ranking compatible small molecules.","batch_size":192,"dropout":0.12,"early_stopping_min_delta":0.001,"early_stopping_patience":5,"fingerprint_bits":2048,"fingerprint_radii":[2,3],"hard_negative_fraction":0.5,"hidden_dim":1024,"learning_rate":0.0015,"log_every_steps":50,"manifest_path":"data/hf_compatibility_prepared/prepared/compat_a10_full_v2/DATASET_MANIFEST.json","max_epochs":30,"max_train_groups":0,"molecule_transformer_batch_size":64,"molecule_transformer_max_length":128,"molecule_transformer_model_name":"","output_dir":"outputs/hf-compatibility-20260309-082913","precomputed_dir":"data/hf_compatibility_precomputed/prepared/compat_a10_precomputed_v2_default","prepared_dir":"data/hf_compatibility_prepared/prepared/compat_a10_full_v2","projection_dim":512,"seed":3407,"train_negative_sets_per_positive":2,"train_negatives_per_example":15,"use_assay_metadata_features":true,"use_chirality":true,"use_maccs":true,"use_rdkit_descriptors":true,"use_train_pool_hard_negatives":true,"weight_decay":0.0001},"created_at":"2026-03-09T06:35:22.517011+00:00","feature_counts":{"assays":11195,"molecule_dim":4293,"molecules":485576,"train_groups":508216},"framework":"pytorch_head_only_compatibility_ranking","manifest_sha256":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b","molecule_feature_spec":{"descriptor_mean":[393.9932861328125,3.3190696239471436,81.43094635009766,27.66873550415039,1.4853101968765259,5.253507137298584,5.344574451446533,3.4271998405456543,2.537726402282715,0.889473557472229,0.5931677222251892,0.30719807744026184,7.426565647125244,0.8948280215263367,1.0,0.008519778959453106,13.359742164611816,0.0011409131111577153,0.6497437953948975,3.245388984680176,3.0564217567443848,0.4622983932495117,0.009246749803423882,0.37894171476364136,0.21943630278110504,0.04746939614415169,0.003896403359249234,13.851603507995605,0.026932962238788605,0.08059912174940109],"descriptor_names":["mol_wt","logp","tpsa","heavy_atoms","hbd","hba","rot_bonds","ring_count","aromatic_rings","aliphatic_rings","saturated_rings","fraction_csp3","heteroatoms","amide_bonds","fragments","formal_charge","max_atomic_num","metal_atom_count","halogen_count","nitrogen_count","oxygen_count","sulfur_count","phosphorus_count","fluorine_count","chlorine_count","bromine_count","iodine_count","aromatic_atom_count","spiro_atoms","bridgehead_atoms"],"descriptor_std":[153.43994140625,1.6963293552398682,56.923919677734375,10.904720306396484,1.9788898229599,2.6205861568450928,4.454139709472656,1.338579535484314,1.1400214433670044,1.0691639184951782,0.9289003610610962,0.19423091411590576,3.905120611190796,1.4490379095077515,1.0,0.09879818558692932,6.657544136047363,0.03842087462544441,1.1370999813079834,2.269305467605591,2.4383604526519775,0.6704627275466919,0.11380217224359512,0.9962579607963562,0.5424068570137024,0.23386727273464203,0.07272558659315109,5.9197611808776855,0.17290985584259033,0.5305002331733704],"fingerprint_bits":2048,"fingerprint_radii":[2,3],"molecule_transformer_max_length":128,"molecule_transformer_model_name":"","use_chirality":true,"use_maccs":true,"use_rdkit_descriptors":true},"precomputed_manifest":{"compat_prepared_manifest_sha256":"d90827950b1bdb4389264593f537d114274a6803654a8e54895c8de2bab5f75a","config":{"fingerprint_bits":2048,"fingerprint_radii":[2,3],"hard_negative_fraction":0.5,"manifest_path":"data/hf_compatibility_prepared/prepared/compat_a10_full_v2/DATASET_MANIFEST.json","output_dir":"outputs/hf_compatibility_precomputed/20260309-002930","prepared_dir":"data/hf_compatibility_prepared/prepared/compat_a10_full_v2","seed":3407,"train_negative_sets_per_positive":2,"train_negatives_per_example":15,"use_chirality":true,"use_maccs":true,"use_rdkit_descriptors":true,"use_train_pool_hard_negatives":true},"created_at":"2026-03-09T02:40:11.768169+00:00","feature_spec":{"bit_dim":4263,"dense_dim":30,"descriptor_mean":[393.9932861328125,3.3190696239471436,81.43094635009766,27.66873550415039,1.4853101968765259,5.253507137298584,5.344574451446533,3.4271998405456543,2.537726402282715,0.889473557472229,0.5931677222251892,0.30719807744026184,7.426565647125244,0.8948280215263367,1.0,0.008519778959453106,13.359742164611816,0.0011409131111577153,0.6497437953948975,3.245388984680176,3.0564217567443848,0.4622983932495117,0.009246749803423882,0.37894171476364136,0.21943630278110504,0.04746939614415169,0.003896403359249234,13.851603507995605,0.026932962238788605,0.08059912174940109],"descriptor_names":["mol_wt","logp","tpsa","heavy_atoms","hbd","hba","rot_bonds","ring_count","aromatic_rings","aliphatic_rings","saturated_rings","fraction_csp3","heteroatoms","amide_bonds","fragments","formal_charge","max_atomic_num","metal_atom_count","halogen_count","nitrogen_count","oxygen_count","sulfur_count","phosphorus_count","fluorine_count","chlorine_count","bromine_count","iodine_count","aromatic_atom_count","spiro_atoms","bridgehead_atoms"],"descriptor_std":[153.43994140625,1.6963293552398682,56.923919677734375,10.904720306396484,1.9788898229599,2.6205861568450928,4.454139709472656,1.338579535484314,1.1400214433670044,1.0691639184951782,0.9289003610610962,0.19423091411590576,3.905120611190796,1.4490379095077515,1.0,0.09879818558692932,6.657544136047363,0.03842087462544441,1.1370999813079834,2.269305467605591,2.4383604526519775,0.6704627275466919,0.11380217224359512,0.9962579607963562,0.5424068570137024,0.23386727273464203,0.07272558659315109,5.9197611808776855,0.17290985584259033,0.5305002331733704],"fingerprint_bits":2048,"fingerprint_radii":[2,3],"molecule_transformer_max_length":128,"molecule_transformer_model_name":"","packed_bit_dim":533,"use_chirality":true,"use_maccs":true,"use_rdkit_descriptors":true},"file_hashes":{"compat_indexed_candidate_pools.parquet":"ffe57259ad1b1e95e8ebe44de4e68aadb31a3a120c99399b356114fcf45f3c42","compat_indexed_train_rows.parquet":"35a21bc5c90ea3742f665de5e224392bce5d4489854063798227e9171f52a54c","compat_molecule_bit_features_packed.npy":"108211cad5f1098d58c54b55b5bbf0a5786a895dee7ba1b5c01cb8f34a1db905","compat_molecule_dense_features.npy":"6f525724b255b0fa4fb43f3bb97ce04eef8bef04365f28313e0e4e6c7707cf8e","compat_molecules.parquet":"457618ae03f36f9dc34752294356ff25acd18f750a51f9761f3ae6530eb10794"},"framework":"compatibility_cpu_precompute_v1","prepared_manifest":{"file_hashes":{"DATASET_MANIFEST.json":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b","compat_assays.parquet":"7810ab6f35cc5f2b4fbfe5bade1c1fcf4b59668eb54e0cf273d12cab0e75c54f","compat_candidate_pools.parquet":"dcd3c4ede17c2746be2b2951630561361168336ebc0efd5464fc9c04b874416e","compat_train_groups.parquet":"c9366522d29c4588fed22e4acd825b43ce7897ef3ab103ccd893ac5565a9d5dd"},"prepared_at":"2026-03-08T18:02:14.604717+00:00","row_counts":{"compat_assays":11195,"compat_candidate_pools":1432532,"compat_train_groups":508216},"seed":3407,"selection_report":{"avg_candidates_per_assay":127.96176864671729,"candidate_pool_rows":1432532,"dropped_after_conflicts_or_caps":1594,"eligible_after_count_thresholds":11195,"mean_train_groups_per_train_assay":1.0,"selected_assays":11195,"total_conflicting_compounds_removed":38920,"train_groups":508216},"sharding":{"merged_from":["shard-00-of-08","shard-01-of-08","shard-02-of-08","shard-03-of-08","shard-04-of-08","shard-05-of-08","shard-06-of-08","shard-07-of-08"],"num_shards":8},"source_dataset_hashes":{"assays_sha256":"4b220df37625a4b006bb5232871c956c150648fb7eac0448b17067f76a06b7b5","measurements_sha256":"1c7cb702e7d694f4c4750e139f09280419ec9d5bcd115f9d0311dfe4c2985ade"},"source_manifest_sha256":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b","split_counts":{"test":1111,"train":8967,"val":1117},"strategy":"compatibility_ranking_v1_sharded_merge","thresholds":{"max_actives_per_assay":48,"max_inactives_per_assay":192,"max_train_groups":0,"min_actives_per_assay":4,"min_inactives_per_assay":16,"negative_sets_per_positive":2,"negatives_per_example":7,"standardize_molecules":true},"train_group_split_counts":{"test":0,"train":508216,"val":0}},"row_counts":{"indexed_candidate_pools":1432532,"indexed_train_rows":508216,"molecules":485576},"source_manifest_sha256":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b"},"prepared_manifest":{"file_hashes":{"DATASET_MANIFEST.json":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b","compat_assays.parquet":"7810ab6f35cc5f2b4fbfe5bade1c1fcf4b59668eb54e0cf273d12cab0e75c54f","compat_candidate_pools.parquet":"dcd3c4ede17c2746be2b2951630561361168336ebc0efd5464fc9c04b874416e","compat_train_groups.parquet":"c9366522d29c4588fed22e4acd825b43ce7897ef3ab103ccd893ac5565a9d5dd"},"prepared_at":"2026-03-08T18:02:14.604717+00:00","row_counts":{"compat_assays":11195,"compat_candidate_pools":1432532,"compat_train_groups":508216},"seed":3407,"selection_report":{"avg_candidates_per_assay":127.96176864671729,"candidate_pool_rows":1432532,"dropped_after_conflicts_or_caps":1594,"eligible_after_count_thresholds":11195,"mean_train_groups_per_train_assay":1.0,"selected_assays":11195,"total_conflicting_compounds_removed":38920,"train_groups":508216},"sharding":{"merged_from":["shard-00-of-08","shard-01-of-08","shard-02-of-08","shard-03-of-08","shard-04-of-08","shard-05-of-08","shard-06-of-08","shard-07-of-08"],"num_shards":8},"source_dataset_hashes":{"assays_sha256":"4b220df37625a4b006bb5232871c956c150648fb7eac0448b17067f76a06b7b5","measurements_sha256":"1c7cb702e7d694f4c4750e139f09280419ec9d5bcd115f9d0311dfe4c2985ade"},"source_manifest_sha256":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b","split_counts":{"test":1111,"train":8967,"val":1117},"strategy":"compatibility_ranking_v1_sharded_merge","thresholds":{"max_actives_per_assay":48,"max_inactives_per_assay":192,"max_train_groups":0,"min_actives_per_assay":4,"min_inactives_per_assay":16,"negative_sets_per_positive":2,"negatives_per_example":7,"standardize_molecules":true},"train_group_split_counts":{"test":0,"train":508216,"val":0}}}
training_summary.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"best_epoch":9,"best_metrics":{"epoch":9,"event":"compatibility_eval","step":23823,"train_loss":0.07582188404882256,"val_assays":1117.0,"val_hit_at_10":0.97224709042077,"val_mean_auprc":0.6214489194406888,"val_mean_auroc":0.77668220123392,"val_mean_ndcg50":0.7140369202520261,"val_random_auprc_baseline":0.26780584983601613},"best_val_mean_auprc":0.6214489194406888,"created_at":"2026-03-09T07:00:12.341166+00:00","event":"compatibility_train_complete","metadata_path":"outputs/hf-compatibility-20260309-082913/training_metadata.json","output_dir":"outputs/hf-compatibility-20260309-082913","test_metrics":{"test_assays":1111.0,"test_hit_at_10":0.9738973897389739,"test_mean_auprc":0.6339480076718866,"test_mean_auroc":0.7814796702989308,"test_mean_ndcg50":0.7250422425217068,"test_random_auprc_baseline":0.27485643596723147},"train_groups":508216}