Publish broader prepared-data compatibility model as stable public artifact
Browse files- README.md +244 -0
- best_model.pt +3 -0
- training_metadata.json +1 -0
- training_summary.json +1 -0
README.md
ADDED
|
@@ -0,0 +1,244 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- bioassay
|
| 4 |
+
- chemistry
|
| 5 |
+
- drug-discovery
|
| 6 |
+
- ranking
|
| 7 |
+
- assay-conditioning
|
| 8 |
+
- rdkit
|
| 9 |
+
- qwen
|
| 10 |
+
library_name: pytorch
|
| 11 |
+
license: mit
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# BioAssayAlign Qwen3-Embedding-0.6B Compatibility
|
| 15 |
+
|
| 16 |
+
BioAssayAlign is an assay-conditioned small-molecule ranking model.
|
| 17 |
+
|
| 18 |
+
Given:
|
| 19 |
+
- an assay description with optional metadata
|
| 20 |
+
- a list of candidate SMILES strings
|
| 21 |
+
|
| 22 |
+
it returns:
|
| 23 |
+
- one compatibility score per molecule
|
| 24 |
+
- a ranked shortlist for that assay
|
| 25 |
+
|
| 26 |
+
This is a ranking model. It is not a chatbot, not a generative chemistry model, and not a direct potency regressor.
|
| 27 |
+
|
| 28 |
+
## What This Model Is For
|
| 29 |
+
|
| 30 |
+
Use it when you already have a candidate list and want to answer:
|
| 31 |
+
|
| 32 |
+
> For this assay, which molecules should I screen first?
|
| 33 |
+
|
| 34 |
+
Reasonable uses:
|
| 35 |
+
- shortlist triage before wet-lab work
|
| 36 |
+
- compare candidate sets against the same assay
|
| 37 |
+
- add a ranking feature to a downstream discovery workflow
|
| 38 |
+
|
| 39 |
+
Not reasonable uses:
|
| 40 |
+
- treating the score as a probability of success
|
| 41 |
+
- predicting exact IC50 / EC50 / Ki
|
| 42 |
+
- using it as a stand-alone medicinal chemistry decision engine
|
| 43 |
+
|
| 44 |
+
## Model Design
|
| 45 |
+
|
| 46 |
+
This artifact uses:
|
| 47 |
+
- assay encoder: `Qwen/Qwen3-Embedding-0.6B` (frozen)
|
| 48 |
+
- molecule representation:
|
| 49 |
+
- Morgan fingerprints, radii `2` and `3`, `2048` bits each
|
| 50 |
+
- chirality-aware fingerprints
|
| 51 |
+
- MACCS keys
|
| 52 |
+
- `30` RDKit descriptors
|
| 53 |
+
- compatibility head:
|
| 54 |
+
- trainable projection layers
|
| 55 |
+
- metadata features on the assay side
|
| 56 |
+
- scorer MLP
|
| 57 |
+
|
| 58 |
+
The final score comes from the learned compatibility head. It is not just a raw embedding dot product.
|
| 59 |
+
|
| 60 |
+
## Training Data
|
| 61 |
+
|
| 62 |
+
Training uses a frozen public assay-compound corpus derived from:
|
| 63 |
+
- PubChem BioAssay
|
| 64 |
+
- ChEMBL
|
| 65 |
+
|
| 66 |
+
The published artifact was trained on a prepared compatibility subset with:
|
| 67 |
+
- assays: `11,195`
|
| 68 |
+
- candidate-pool rows: `1,432,532`
|
| 69 |
+
- training groups: `508,216`
|
| 70 |
+
|
| 71 |
+
Held-out assay split:
|
| 72 |
+
- train assays: `8,967`
|
| 73 |
+
- validation assays: `1,117`
|
| 74 |
+
- test assays: `1,111`
|
| 75 |
+
|
| 76 |
+
Each training group contains:
|
| 77 |
+
- one assay
|
| 78 |
+
- one active compound
|
| 79 |
+
- multiple explicit same-assay inactive compounds
|
| 80 |
+
|
| 81 |
+
This means the model is trained for assay-conditioned ranking, not generic text retrieval.
|
| 82 |
+
|
| 83 |
+
## Main Results
|
| 84 |
+
|
| 85 |
+
Best validation checkpoint:
|
| 86 |
+
- validation mean AUPRC: `0.6214`
|
| 87 |
+
- best epoch: `9`
|
| 88 |
+
|
| 89 |
+
Held-out test metrics:
|
| 90 |
+
- test mean AUPRC: `0.6339`
|
| 91 |
+
- test random-baseline AUPRC: `0.2749`
|
| 92 |
+
- test hit@10: `0.9739`
|
| 93 |
+
- test mean AUROC: `0.7815`
|
| 94 |
+
- test mean nDCG@50: `0.7250`
|
| 95 |
+
|
| 96 |
+
Interpretation:
|
| 97 |
+
- the model substantially beats the random ranking baseline
|
| 98 |
+
- it is strongest as a relative ranking tool over the submitted candidate list
|
| 99 |
+
|
| 100 |
+
## Real Example Predictions
|
| 101 |
+
|
| 102 |
+
These were produced locally from the published weights and metadata only.
|
| 103 |
+
|
| 104 |
+
### Example 1: JAK2 cell-based assay
|
| 105 |
+
|
| 106 |
+
Assay:
|
| 107 |
+
- title: `JAK2 inhibition assay`
|
| 108 |
+
- description: `Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.`
|
| 109 |
+
- organism: `Homo sapiens`
|
| 110 |
+
- readout: `luminescence`
|
| 111 |
+
- assay format: `cell-based`
|
| 112 |
+
- assay type: `inhibition`
|
| 113 |
+
- target UniProt: `O60674`
|
| 114 |
+
|
| 115 |
+
Candidate list ranked by the model:
|
| 116 |
+
1. `CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1` → `-16.94`
|
| 117 |
+
2. `c1ccccc1` → `-31.24`
|
| 118 |
+
3. `CCO` → `-41.44`
|
| 119 |
+
4. `CCOc1ccc2nc(N3CCN(C)CC3)n(C)c(=O)c2c1` → `-45.44`
|
| 120 |
+
|
| 121 |
+
### Example 2: ALDH1A1 fluorescence assay
|
| 122 |
+
|
| 123 |
+
Assay:
|
| 124 |
+
- title: `ALDH1A1 inhibition assay`
|
| 125 |
+
- description: `Cell-based fluorescence assay measuring ALDH1A1 inhibition in human cells.`
|
| 126 |
+
- organism: `Homo sapiens`
|
| 127 |
+
- readout: `fluorescence`
|
| 128 |
+
- assay format: `cell-based`
|
| 129 |
+
- assay type: `inhibition`
|
| 130 |
+
- target UniProt: `P00352`
|
| 131 |
+
|
| 132 |
+
Candidate list ranked by the model:
|
| 133 |
+
1. `CCOc1ccccc1` → `-30.87`
|
| 134 |
+
2. `CCN(CC)CCOc1ccccc1` → `-34.09`
|
| 135 |
+
3. `Cc1cc(=O)n(C)c(=O)[nH]1` → `-34.33`
|
| 136 |
+
4. `CCO` → `-37.07`
|
| 137 |
+
|
| 138 |
+
The raw values above are model scores. In practice, read them as list-relative ranking values, not calibrated probabilities.
|
| 139 |
+
|
| 140 |
+
## How To Run It Locally
|
| 141 |
+
|
| 142 |
+
### Minimal local check from this repo
|
| 143 |
+
|
| 144 |
+
This downloads only the model weights and metadata, not the raw assay dataset.
|
| 145 |
+
|
| 146 |
+
```bash
|
| 147 |
+
MODEL_REPO_ID='lighteternal/BioAssayAlign-Qwen3-Embedding-0.6B-Compatibility' \
|
| 148 |
+
LOCAL_MODEL_DIR='data/hf_compat_model_check' \
|
| 149 |
+
bash scripts/score_compatibility_from_hf.sh \
|
| 150 |
+
--assay-title 'JAK2 inhibition assay' \
|
| 151 |
+
--description 'Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.' \
|
| 152 |
+
--organism 'Homo sapiens' \
|
| 153 |
+
--readout 'luminescence' \
|
| 154 |
+
--assay-format 'cell-based' \
|
| 155 |
+
--assay-type 'inhibition' \
|
| 156 |
+
--target-uniprot O60674 \
|
| 157 |
+
--smiles 'CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1' \
|
| 158 |
+
--smiles 'c1ccccc1' \
|
| 159 |
+
--smiles 'CCO'
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
### Python usage
|
| 163 |
+
|
| 164 |
+
```python
|
| 165 |
+
from bioassayalign.compat_inference import (
|
| 166 |
+
AssayQuery,
|
| 167 |
+
load_compatibility_model,
|
| 168 |
+
rank_compounds,
|
| 169 |
+
serialize_assay_query,
|
| 170 |
+
)
|
| 171 |
+
|
| 172 |
+
model = load_compatibility_model("/path/to/model_dir")
|
| 173 |
+
assay_text = serialize_assay_query(
|
| 174 |
+
AssayQuery(
|
| 175 |
+
title="JAK2 inhibition assay",
|
| 176 |
+
description="Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.",
|
| 177 |
+
organism="Homo sapiens",
|
| 178 |
+
readout="luminescence",
|
| 179 |
+
assay_format="cell-based",
|
| 180 |
+
assay_type="inhibition",
|
| 181 |
+
target_uniprot=["O60674"],
|
| 182 |
+
)
|
| 183 |
+
)
|
| 184 |
+
|
| 185 |
+
results = rank_compounds(
|
| 186 |
+
model,
|
| 187 |
+
assay_text=assay_text,
|
| 188 |
+
smiles_list=[
|
| 189 |
+
"CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1",
|
| 190 |
+
"c1ccccc1",
|
| 191 |
+
"CCO",
|
| 192 |
+
],
|
| 193 |
+
)
|
| 194 |
+
```
|
| 195 |
+
|
| 196 |
+
## Input Guidance
|
| 197 |
+
|
| 198 |
+
Best practice:
|
| 199 |
+
- provide structured assay fields
|
| 200 |
+
- use chemically sensible parent SMILES
|
| 201 |
+
- include target UniProt IDs when the assay is target-defined
|
| 202 |
+
|
| 203 |
+
Recommended assay fields:
|
| 204 |
+
- title
|
| 205 |
+
- description
|
| 206 |
+
- organism
|
| 207 |
+
- readout
|
| 208 |
+
- assay format
|
| 209 |
+
- assay type
|
| 210 |
+
- target UniProt IDs
|
| 211 |
+
|
| 212 |
+
The model is reasonably robust to wording changes, but missing metadata can still reduce quality.
|
| 213 |
+
|
| 214 |
+
## How To Read The Output
|
| 215 |
+
|
| 216 |
+
Use the output in this order:
|
| 217 |
+
1. relative ranking within your submitted list
|
| 218 |
+
2. top-K shortlist
|
| 219 |
+
3. chemistry context columns such as molecular weight, logP, TPSA
|
| 220 |
+
4. raw model score only for debugging
|
| 221 |
+
|
| 222 |
+
The score is:
|
| 223 |
+
- useful for comparing molecules within the same submitted list
|
| 224 |
+
- not meaningful as a global absolute “quality” number across unrelated lists
|
| 225 |
+
|
| 226 |
+
## Limitations
|
| 227 |
+
|
| 228 |
+
- The score is not a calibrated probability.
|
| 229 |
+
- The model does not predict exact potency.
|
| 230 |
+
- The benchmark is assay-held-out, not a full unseen-scaffold universal benchmark.
|
| 231 |
+
- Public assay data contains label noise and heterogeneous assay protocols.
|
| 232 |
+
- Some assays remain difficult and produce only moderate ranking quality.
|
| 233 |
+
|
| 234 |
+
## Provenance
|
| 235 |
+
|
| 236 |
+
Project code:
|
| 237 |
+
- `https://github.com/lighteternal/bioassayalign-private`
|
| 238 |
+
|
| 239 |
+
Model files in this repo:
|
| 240 |
+
- `best_model.pt`
|
| 241 |
+
- `training_metadata.json`
|
| 242 |
+
- `training_summary.json`
|
| 243 |
+
|
| 244 |
+
If the public assay-ranking model is updated later, this repo keeps the same public identity while the internal experiment lineage can continue separately.
|
best_model.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e836ad1a7fa6dc405a46ca47edc712fd5121ad639dbc51632773802cdcf2a18f
|
| 3 |
+
size 19601186
|
training_metadata.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"compat_prepared_manifest_sha256":"d90827950b1bdb4389264593f537d114274a6803654a8e54895c8de2bab5f75a","config":{"assay_batch_size":64,"assay_metadata_dim":128,"assay_model_name":"Qwen/Qwen3-Embedding-0.6B","assay_task_description":"Given a bioassay description and metadata, represent the assay for ranking compatible small molecules.","batch_size":192,"dropout":0.12,"early_stopping_min_delta":0.001,"early_stopping_patience":5,"fingerprint_bits":2048,"fingerprint_radii":[2,3],"hard_negative_fraction":0.5,"hidden_dim":1024,"learning_rate":0.0015,"log_every_steps":50,"manifest_path":"data/hf_compatibility_prepared/prepared/compat_a10_full_v2/DATASET_MANIFEST.json","max_epochs":30,"max_train_groups":0,"molecule_transformer_batch_size":64,"molecule_transformer_max_length":128,"molecule_transformer_model_name":"","output_dir":"outputs/hf-compatibility-20260309-082913","precomputed_dir":"data/hf_compatibility_precomputed/prepared/compat_a10_precomputed_v2_default","prepared_dir":"data/hf_compatibility_prepared/prepared/compat_a10_full_v2","projection_dim":512,"seed":3407,"train_negative_sets_per_positive":2,"train_negatives_per_example":15,"use_assay_metadata_features":true,"use_chirality":true,"use_maccs":true,"use_rdkit_descriptors":true,"use_train_pool_hard_negatives":true,"weight_decay":0.0001},"created_at":"2026-03-09T06:35:22.517011+00:00","feature_counts":{"assays":11195,"molecule_dim":4293,"molecules":485576,"train_groups":508216},"framework":"pytorch_head_only_compatibility_ranking","manifest_sha256":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b","molecule_feature_spec":{"descriptor_mean":[393.9932861328125,3.3190696239471436,81.43094635009766,27.66873550415039,1.4853101968765259,5.253507137298584,5.344574451446533,3.4271998405456543,2.537726402282715,0.889473557472229,0.5931677222251892,0.30719807744026184,7.426565647125244,0.8948280215263367,1.0,0.008519778959453106,13.359742164611816,0.0011409131111577153,0.6497437953948975,3.245388984680176,3.0564217567443848,0.4622983932495117,0.009246749803423882,0.37894171476364136,0.21943630278110504,0.04746939614415169,0.003896403359249234,13.851603507995605,0.026932962238788605,0.08059912174940109],"descriptor_names":["mol_wt","logp","tpsa","heavy_atoms","hbd","hba","rot_bonds","ring_count","aromatic_rings","aliphatic_rings","saturated_rings","fraction_csp3","heteroatoms","amide_bonds","fragments","formal_charge","max_atomic_num","metal_atom_count","halogen_count","nitrogen_count","oxygen_count","sulfur_count","phosphorus_count","fluorine_count","chlorine_count","bromine_count","iodine_count","aromatic_atom_count","spiro_atoms","bridgehead_atoms"],"descriptor_std":[153.43994140625,1.6963293552398682,56.923919677734375,10.904720306396484,1.9788898229599,2.6205861568450928,4.454139709472656,1.338579535484314,1.1400214433670044,1.0691639184951782,0.9289003610610962,0.19423091411590576,3.905120611190796,1.4490379095077515,1.0,0.09879818558692932,6.657544136047363,0.03842087462544441,1.1370999813079834,2.269305467605591,2.4383604526519775,0.6704627275466919,0.11380217224359512,0.9962579607963562,0.5424068570137024,0.23386727273464203,0.07272558659315109,5.9197611808776855,0.17290985584259033,0.5305002331733704],"fingerprint_bits":2048,"fingerprint_radii":[2,3],"molecule_transformer_max_length":128,"molecule_transformer_model_name":"","use_chirality":true,"use_maccs":true,"use_rdkit_descriptors":true},"precomputed_manifest":{"compat_prepared_manifest_sha256":"d90827950b1bdb4389264593f537d114274a6803654a8e54895c8de2bab5f75a","config":{"fingerprint_bits":2048,"fingerprint_radii":[2,3],"hard_negative_fraction":0.5,"manifest_path":"data/hf_compatibility_prepared/prepared/compat_a10_full_v2/DATASET_MANIFEST.json","output_dir":"outputs/hf_compatibility_precomputed/20260309-002930","prepared_dir":"data/hf_compatibility_prepared/prepared/compat_a10_full_v2","seed":3407,"train_negative_sets_per_positive":2,"train_negatives_per_example":15,"use_chirality":true,"use_maccs":true,"use_rdkit_descriptors":true,"use_train_pool_hard_negatives":true},"created_at":"2026-03-09T02:40:11.768169+00:00","feature_spec":{"bit_dim":4263,"dense_dim":30,"descriptor_mean":[393.9932861328125,3.3190696239471436,81.43094635009766,27.66873550415039,1.4853101968765259,5.253507137298584,5.344574451446533,3.4271998405456543,2.537726402282715,0.889473557472229,0.5931677222251892,0.30719807744026184,7.426565647125244,0.8948280215263367,1.0,0.008519778959453106,13.359742164611816,0.0011409131111577153,0.6497437953948975,3.245388984680176,3.0564217567443848,0.4622983932495117,0.009246749803423882,0.37894171476364136,0.21943630278110504,0.04746939614415169,0.003896403359249234,13.851603507995605,0.026932962238788605,0.08059912174940109],"descriptor_names":["mol_wt","logp","tpsa","heavy_atoms","hbd","hba","rot_bonds","ring_count","aromatic_rings","aliphatic_rings","saturated_rings","fraction_csp3","heteroatoms","amide_bonds","fragments","formal_charge","max_atomic_num","metal_atom_count","halogen_count","nitrogen_count","oxygen_count","sulfur_count","phosphorus_count","fluorine_count","chlorine_count","bromine_count","iodine_count","aromatic_atom_count","spiro_atoms","bridgehead_atoms"],"descriptor_std":[153.43994140625,1.6963293552398682,56.923919677734375,10.904720306396484,1.9788898229599,2.6205861568450928,4.454139709472656,1.338579535484314,1.1400214433670044,1.0691639184951782,0.9289003610610962,0.19423091411590576,3.905120611190796,1.4490379095077515,1.0,0.09879818558692932,6.657544136047363,0.03842087462544441,1.1370999813079834,2.269305467605591,2.4383604526519775,0.6704627275466919,0.11380217224359512,0.9962579607963562,0.5424068570137024,0.23386727273464203,0.07272558659315109,5.9197611808776855,0.17290985584259033,0.5305002331733704],"fingerprint_bits":2048,"fingerprint_radii":[2,3],"molecule_transformer_max_length":128,"molecule_transformer_model_name":"","packed_bit_dim":533,"use_chirality":true,"use_maccs":true,"use_rdkit_descriptors":true},"file_hashes":{"compat_indexed_candidate_pools.parquet":"ffe57259ad1b1e95e8ebe44de4e68aadb31a3a120c99399b356114fcf45f3c42","compat_indexed_train_rows.parquet":"35a21bc5c90ea3742f665de5e224392bce5d4489854063798227e9171f52a54c","compat_molecule_bit_features_packed.npy":"108211cad5f1098d58c54b55b5bbf0a5786a895dee7ba1b5c01cb8f34a1db905","compat_molecule_dense_features.npy":"6f525724b255b0fa4fb43f3bb97ce04eef8bef04365f28313e0e4e6c7707cf8e","compat_molecules.parquet":"457618ae03f36f9dc34752294356ff25acd18f750a51f9761f3ae6530eb10794"},"framework":"compatibility_cpu_precompute_v1","prepared_manifest":{"file_hashes":{"DATASET_MANIFEST.json":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b","compat_assays.parquet":"7810ab6f35cc5f2b4fbfe5bade1c1fcf4b59668eb54e0cf273d12cab0e75c54f","compat_candidate_pools.parquet":"dcd3c4ede17c2746be2b2951630561361168336ebc0efd5464fc9c04b874416e","compat_train_groups.parquet":"c9366522d29c4588fed22e4acd825b43ce7897ef3ab103ccd893ac5565a9d5dd"},"prepared_at":"2026-03-08T18:02:14.604717+00:00","row_counts":{"compat_assays":11195,"compat_candidate_pools":1432532,"compat_train_groups":508216},"seed":3407,"selection_report":{"avg_candidates_per_assay":127.96176864671729,"candidate_pool_rows":1432532,"dropped_after_conflicts_or_caps":1594,"eligible_after_count_thresholds":11195,"mean_train_groups_per_train_assay":1.0,"selected_assays":11195,"total_conflicting_compounds_removed":38920,"train_groups":508216},"sharding":{"merged_from":["shard-00-of-08","shard-01-of-08","shard-02-of-08","shard-03-of-08","shard-04-of-08","shard-05-of-08","shard-06-of-08","shard-07-of-08"],"num_shards":8},"source_dataset_hashes":{"assays_sha256":"4b220df37625a4b006bb5232871c956c150648fb7eac0448b17067f76a06b7b5","measurements_sha256":"1c7cb702e7d694f4c4750e139f09280419ec9d5bcd115f9d0311dfe4c2985ade"},"source_manifest_sha256":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b","split_counts":{"test":1111,"train":8967,"val":1117},"strategy":"compatibility_ranking_v1_sharded_merge","thresholds":{"max_actives_per_assay":48,"max_inactives_per_assay":192,"max_train_groups":0,"min_actives_per_assay":4,"min_inactives_per_assay":16,"negative_sets_per_positive":2,"negatives_per_example":7,"standardize_molecules":true},"train_group_split_counts":{"test":0,"train":508216,"val":0}},"row_counts":{"indexed_candidate_pools":1432532,"indexed_train_rows":508216,"molecules":485576},"source_manifest_sha256":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b"},"prepared_manifest":{"file_hashes":{"DATASET_MANIFEST.json":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b","compat_assays.parquet":"7810ab6f35cc5f2b4fbfe5bade1c1fcf4b59668eb54e0cf273d12cab0e75c54f","compat_candidate_pools.parquet":"dcd3c4ede17c2746be2b2951630561361168336ebc0efd5464fc9c04b874416e","compat_train_groups.parquet":"c9366522d29c4588fed22e4acd825b43ce7897ef3ab103ccd893ac5565a9d5dd"},"prepared_at":"2026-03-08T18:02:14.604717+00:00","row_counts":{"compat_assays":11195,"compat_candidate_pools":1432532,"compat_train_groups":508216},"seed":3407,"selection_report":{"avg_candidates_per_assay":127.96176864671729,"candidate_pool_rows":1432532,"dropped_after_conflicts_or_caps":1594,"eligible_after_count_thresholds":11195,"mean_train_groups_per_train_assay":1.0,"selected_assays":11195,"total_conflicting_compounds_removed":38920,"train_groups":508216},"sharding":{"merged_from":["shard-00-of-08","shard-01-of-08","shard-02-of-08","shard-03-of-08","shard-04-of-08","shard-05-of-08","shard-06-of-08","shard-07-of-08"],"num_shards":8},"source_dataset_hashes":{"assays_sha256":"4b220df37625a4b006bb5232871c956c150648fb7eac0448b17067f76a06b7b5","measurements_sha256":"1c7cb702e7d694f4c4750e139f09280419ec9d5bcd115f9d0311dfe4c2985ade"},"source_manifest_sha256":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b","split_counts":{"test":1111,"train":8967,"val":1117},"strategy":"compatibility_ranking_v1_sharded_merge","thresholds":{"max_actives_per_assay":48,"max_inactives_per_assay":192,"max_train_groups":0,"min_actives_per_assay":4,"min_inactives_per_assay":16,"negative_sets_per_positive":2,"negatives_per_example":7,"standardize_molecules":true},"train_group_split_counts":{"test":0,"train":508216,"val":0}}}
|
training_summary.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"best_epoch":9,"best_metrics":{"epoch":9,"event":"compatibility_eval","step":23823,"train_loss":0.07582188404882256,"val_assays":1117.0,"val_hit_at_10":0.97224709042077,"val_mean_auprc":0.6214489194406888,"val_mean_auroc":0.77668220123392,"val_mean_ndcg50":0.7140369202520261,"val_random_auprc_baseline":0.26780584983601613},"best_val_mean_auprc":0.6214489194406888,"created_at":"2026-03-09T07:00:12.341166+00:00","event":"compatibility_train_complete","metadata_path":"outputs/hf-compatibility-20260309-082913/training_metadata.json","output_dir":"outputs/hf-compatibility-20260309-082913","test_metrics":{"test_assays":1111.0,"test_hit_at_10":0.9738973897389739,"test_mean_auprc":0.6339480076718866,"test_mean_auroc":0.7814796702989308,"test_mean_ndcg50":0.7250422425217068,"test_random_auprc_baseline":0.27485643596723147},"train_groups":508216}
|