Publish broader prepared-data compatibility model as stable public artifact

Browse files

Files changed (4) hide show

README.md +244 -0
best_model.pt +3 -0
training_metadata.json +1 -0
training_summary.json +1 -0

README.md ADDED Viewed

	@@ -0,0 +1,244 @@

+---
+tags:
+- bioassay
+- chemistry
+- drug-discovery
+- ranking
+- assay-conditioning
+- rdkit
+- qwen
+library_name: pytorch
+license: mit
+---
+# BioAssayAlign Qwen3-Embedding-0.6B Compatibility
+BioAssayAlign is an assay-conditioned small-molecule ranking model.
+Given:
+- an assay description with optional metadata
+- a list of candidate SMILES strings
+it returns:
+- one compatibility score per molecule
+- a ranked shortlist for that assay
+This is a ranking model. It is not a chatbot, not a generative chemistry model, and not a direct potency regressor.
+## What This Model Is For
+Use it when you already have a candidate list and want to answer:
+> For this assay, which molecules should I screen first?
+Reasonable uses:
+- shortlist triage before wet-lab work
+- compare candidate sets against the same assay
+- add a ranking feature to a downstream discovery workflow
+Not reasonable uses:
+- treating the score as a probability of success
+- predicting exact IC50 / EC50 / Ki
+- using it as a stand-alone medicinal chemistry decision engine
+## Model Design
+This artifact uses:
+- assay encoder: `Qwen/Qwen3-Embedding-0.6B` (frozen)
+- molecule representation:
+  - Morgan fingerprints, radii `2` and `3`, `2048` bits each
+  - chirality-aware fingerprints
+  - MACCS keys
+  - `30` RDKit descriptors
+- compatibility head:
+  - trainable projection layers
+  - metadata features on the assay side
+  - scorer MLP
+The final score comes from the learned compatibility head. It is not just a raw embedding dot product.
+## Training Data
+Training uses a frozen public assay-compound corpus derived from:
+- PubChem BioAssay
+- ChEMBL
+The published artifact was trained on a prepared compatibility subset with:
+- assays: `11,195`
+- candidate-pool rows: `1,432,532`
+- training groups: `508,216`
+Held-out assay split:
+- train assays: `8,967`
+- validation assays: `1,117`
+- test assays: `1,111`
+Each training group contains:
+- one assay
+- one active compound
+- multiple explicit same-assay inactive compounds
+This means the model is trained for assay-conditioned ranking, not generic text retrieval.
+## Main Results
+Best validation checkpoint:
+- validation mean AUPRC: `0.6214`
+- best epoch: `9`
+Held-out test metrics:
+- test mean AUPRC: `0.6339`
+- test random-baseline AUPRC: `0.2749`
+- test hit@10: `0.9739`
+- test mean AUROC: `0.7815`
+- test mean nDCG@50: `0.7250`
+Interpretation:
+- the model substantially beats the random ranking baseline
+- it is strongest as a relative ranking tool over the submitted candidate list
+## Real Example Predictions
+These were produced locally from the published weights and metadata only.
+### Example 1: JAK2 cell-based assay
+Assay:
+- title: `JAK2 inhibition assay`
+- description: `Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.`
+- organism: `Homo sapiens`
+- readout: `luminescence`
+- assay format: `cell-based`
+- assay type: `inhibition`
+- target UniProt: `O60674`
+Candidate list ranked by the model:
+1. `CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1` → `-16.94`
+2. `c1ccccc1` → `-31.24`
+3. `CCO` → `-41.44`
+4. `CCOc1ccc2nc(N3CCN(C)CC3)n(C)c(=O)c2c1` → `-45.44`
+### Example 2: ALDH1A1 fluorescence assay
+Assay:
+- title: `ALDH1A1 inhibition assay`
+- description: `Cell-based fluorescence assay measuring ALDH1A1 inhibition in human cells.`
+- organism: `Homo sapiens`
+- readout: `fluorescence`
+- assay format: `cell-based`
+- assay type: `inhibition`
+- target UniProt: `P00352`
+Candidate list ranked by the model:
+1. `CCOc1ccccc1` → `-30.87`
+2. `CCN(CC)CCOc1ccccc1` → `-34.09`
+3. `Cc1cc(=O)n(C)c(=O)[nH]1` → `-34.33`
+4. `CCO` → `-37.07`
+The raw values above are model scores. In practice, read them as list-relative ranking values, not calibrated probabilities.
+## How To Run It Locally
+### Minimal local check from this repo
+This downloads only the model weights and metadata, not the raw assay dataset.
+```bash
+MODEL_REPO_ID='lighteternal/BioAssayAlign-Qwen3-Embedding-0.6B-Compatibility' \
+LOCAL_MODEL_DIR='data/hf_compat_model_check' \
+bash scripts/score_compatibility_from_hf.sh \
+  --assay-title 'JAK2 inhibition assay' \
+  --description 'Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.' \
+  --organism 'Homo sapiens' \
+  --readout 'luminescence' \
+  --assay-format 'cell-based' \
+  --assay-type 'inhibition' \
+  --target-uniprot O60674 \
+  --smiles 'CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1' \
+  --smiles 'c1ccccc1' \
+  --smiles 'CCO'
+```
+### Python usage
+```python
+from bioassayalign.compat_inference import (
+    AssayQuery,
+    load_compatibility_model,
+    rank_compounds,
+    serialize_assay_query,
+)
+model = load_compatibility_model("/path/to/model_dir")
+assay_text = serialize_assay_query(
+    AssayQuery(
+        title="JAK2 inhibition assay",
+        description="Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.",
+        organism="Homo sapiens",
+        readout="luminescence",
+        assay_format="cell-based",
+        assay_type="inhibition",
+        target_uniprot=["O60674"],
+    )
+)
+results = rank_compounds(
+    model,
+    assay_text=assay_text,
+    smiles_list=[
+        "CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1",
+        "c1ccccc1",
+        "CCO",
+    ],
+)
+```
+## Input Guidance
+Best practice:
+- provide structured assay fields
+- use chemically sensible parent SMILES
+- include target UniProt IDs when the assay is target-defined
+Recommended assay fields:
+- title
+- description
+- organism
+- readout
+- assay format
+- assay type
+- target UniProt IDs
+The model is reasonably robust to wording changes, but missing metadata can still reduce quality.
+## How To Read The Output
+Use the output in this order:
+1. relative ranking within your submitted list
+2. top-K shortlist
+3. chemistry context columns such as molecular weight, logP, TPSA
+4. raw model score only for debugging
+The score is:
+- useful for comparing molecules within the same submitted list
+- not meaningful as a global absolute “quality” number across unrelated lists
+## Limitations
+- The score is not a calibrated probability.
+- The model does not predict exact potency.
+- The benchmark is assay-held-out, not a full unseen-scaffold universal benchmark.
+- Public assay data contains label noise and heterogeneous assay protocols.
+- Some assays remain difficult and produce only moderate ranking quality.
+## Provenance
+Project code:
+- `https://github.com/lighteternal/bioassayalign-private`
+Model files in this repo:
+- `best_model.pt`
+- `training_metadata.json`
+- `training_summary.json`
+If the public assay-ranking model is updated later, this repo keeps the same public identity while the internal experiment lineage can continue separately.

best_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e836ad1a7fa6dc405a46ca47edc712fd5121ad639dbc51632773802cdcf2a18f
+size 19601186

training_metadata.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"compat_prepared_manifest_sha256":"d90827950b1bdb4389264593f537d114274a6803654a8e54895c8de2bab5f75a","config":{"assay_batch_size":64,"assay_metadata_dim":128,"assay_model_name":"Qwen/Qwen3-Embedding-0.6B","assay_task_description":"Given a bioassay description and metadata, represent the assay for ranking compatible small molecules.","batch_size":192,"dropout":0.12,"early_stopping_min_delta":0.001,"early_stopping_patience":5,"fingerprint_bits":2048,"fingerprint_radii":[2,3],"hard_negative_fraction":0.5,"hidden_dim":1024,"learning_rate":0.0015,"log_every_steps":50,"manifest_path":"data/hf_compatibility_prepared/prepared/compat_a10_full_v2/DATASET_MANIFEST.json","max_epochs":30,"max_train_groups":0,"molecule_transformer_batch_size":64,"molecule_transformer_max_length":128,"molecule_transformer_model_name":"","output_dir":"outputs/hf-compatibility-20260309-082913","precomputed_dir":"data/hf_compatibility_precomputed/prepared/compat_a10_precomputed_v2_default","prepared_dir":"data/hf_compatibility_prepared/prepared/compat_a10_full_v2","projection_dim":512,"seed":3407,"train_negative_sets_per_positive":2,"train_negatives_per_example":15,"use_assay_metadata_features":true,"use_chirality":true,"use_maccs":true,"use_rdkit_descriptors":true,"use_train_pool_hard_negatives":true,"weight_decay":0.0001},"created_at":"2026-03-09T06:35:22.517011+00:00","feature_counts":{"assays":11195,"molecule_dim":4293,"molecules":485576,"train_groups":508216},"framework":"pytorch_head_only_compatibility_ranking","manifest_sha256":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b","molecule_feature_spec":{"descriptor_mean":[393.9932861328125,3.3190696239471436,81.43094635009766,27.66873550415039,1.4853101968765259,5.253507137298584,5.344574451446533,3.4271998405456543,2.537726402282715,0.889473557472229,0.5931677222251892,0.30719807744026184,7.426565647125244,0.8948280215263367,1.0,0.008519778959453106,13.359742164611816,0.0011409131111577153,0.6497437953948975,3.245388984680176,3.0564217567443848,0.4622983932495117,0.009246749803423882,0.37894171476364136,0.21943630278110504,0.04746939614415169,0.003896403359249234,13.851603507995605,0.026932962238788605,0.08059912174940109],"descriptor_names":["mol_wt","logp","tpsa","heavy_atoms","hbd","hba","rot_bonds","ring_count","aromatic_rings","aliphatic_rings","saturated_rings","fraction_csp3","heteroatoms","amide_bonds","fragments","formal_charge","max_atomic_num","metal_atom_count","halogen_count","nitrogen_count","oxygen_count","sulfur_count","phosphorus_count","fluorine_count","chlorine_count","bromine_count","iodine_count","aromatic_atom_count","spiro_atoms","bridgehead_atoms"],"descriptor_std":[153.43994140625,1.6963293552398682,56.923919677734375,10.904720306396484,1.9788898229599,2.6205861568450928,4.454139709472656,1.338579535484314,1.1400214433670044,1.0691639184951782,0.9289003610610962,0.19423091411590576,3.905120611190796,1.4490379095077515,1.0,0.09879818558692932,6.657544136047363,0.03842087462544441,1.1370999813079834,2.269305467605591,2.4383604526519775,0.6704627275466919,0.11380217224359512,0.9962579607963562,0.5424068570137024,0.23386727273464203,0.07272558659315109,5.9197611808776855,0.17290985584259033,0.5305002331733704],"fingerprint_bits":2048,"fingerprint_radii":[2,3],"molecule_transformer_max_length":128,"molecule_transformer_model_name":"","use_chirality":true,"use_maccs":true,"use_rdkit_descriptors":true},"precomputed_manifest":{"compat_prepared_manifest_sha256":"d90827950b1bdb4389264593f537d114274a6803654a8e54895c8de2bab5f75a","config":{"fingerprint_bits":2048,"fingerprint_radii":[2,3],"hard_negative_fraction":0.5,"manifest_path":"data/hf_compatibility_prepared/prepared/compat_a10_full_v2/DATASET_MANIFEST.json","output_dir":"outputs/hf_compatibility_precomputed/20260309-002930","prepared_dir":"data/hf_compatibility_prepared/prepared/compat_a10_full_v2","seed":3407,"train_negative_sets_per_positive":2,"train_negatives_per_example":15,"use_chirality":true,"use_maccs":true,"use_rdkit_descriptors":true,"use_train_pool_hard_negatives":true},"created_at":"2026-03-09T02:40:11.768169+00:00","feature_spec":{"bit_dim":4263,"dense_dim":30,"descriptor_mean":[393.9932861328125,3.3190696239471436,81.43094635009766,27.66873550415039,1.4853101968765259,5.253507137298584,5.344574451446533,3.4271998405456543,2.537726402282715,0.889473557472229,0.5931677222251892,0.30719807744026184,7.426565647125244,0.8948280215263367,1.0,0.008519778959453106,13.359742164611816,0.0011409131111577153,0.6497437953948975,3.245388984680176,3.0564217567443848,0.4622983932495117,0.009246749803423882,0.37894171476364136,0.21943630278110504,0.04746939614415169,0.003896403359249234,13.851603507995605,0.026932962238788605,0.08059912174940109],"descriptor_names":["mol_wt","logp","tpsa","heavy_atoms","hbd","hba","rot_bonds","ring_count","aromatic_rings","aliphatic_rings","saturated_rings","fraction_csp3","heteroatoms","amide_bonds","fragments","formal_charge","max_atomic_num","metal_atom_count","halogen_count","nitrogen_count","oxygen_count","sulfur_count","phosphorus_count","fluorine_count","chlorine_count","bromine_count","iodine_count","aromatic_atom_count","spiro_atoms","bridgehead_atoms"],"descriptor_std":[153.43994140625,1.6963293552398682,56.923919677734375,10.904720306396484,1.9788898229599,2.6205861568450928,4.454139709472656,1.338579535484314,1.1400214433670044,1.0691639184951782,0.9289003610610962,0.19423091411590576,3.905120611190796,1.4490379095077515,1.0,0.09879818558692932,6.657544136047363,0.03842087462544441,1.1370999813079834,2.269305467605591,2.4383604526519775,0.6704627275466919,0.11380217224359512,0.9962579607963562,0.5424068570137024,0.23386727273464203,0.07272558659315109,5.9197611808776855,0.17290985584259033,0.5305002331733704],"fingerprint_bits":2048,"fingerprint_radii":[2,3],"molecule_transformer_max_length":128,"molecule_transformer_model_name":"","packed_bit_dim":533,"use_chirality":true,"use_maccs":true,"use_rdkit_descriptors":true},"file_hashes":{"compat_indexed_candidate_pools.parquet":"ffe57259ad1b1e95e8ebe44de4e68aadb31a3a120c99399b356114fcf45f3c42","compat_indexed_train_rows.parquet":"35a21bc5c90ea3742f665de5e224392bce5d4489854063798227e9171f52a54c","compat_molecule_bit_features_packed.npy":"108211cad5f1098d58c54b55b5bbf0a5786a895dee7ba1b5c01cb8f34a1db905","compat_molecule_dense_features.npy":"6f525724b255b0fa4fb43f3bb97ce04eef8bef04365f28313e0e4e6c7707cf8e","compat_molecules.parquet":"457618ae03f36f9dc34752294356ff25acd18f750a51f9761f3ae6530eb10794"},"framework":"compatibility_cpu_precompute_v1","prepared_manifest":{"file_hashes":{"DATASET_MANIFEST.json":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b","compat_assays.parquet":"7810ab6f35cc5f2b4fbfe5bade1c1fcf4b59668eb54e0cf273d12cab0e75c54f","compat_candidate_pools.parquet":"dcd3c4ede17c2746be2b2951630561361168336ebc0efd5464fc9c04b874416e","compat_train_groups.parquet":"c9366522d29c4588fed22e4acd825b43ce7897ef3ab103ccd893ac5565a9d5dd"},"prepared_at":"2026-03-08T18:02:14.604717+00:00","row_counts":{"compat_assays":11195,"compat_candidate_pools":1432532,"compat_train_groups":508216},"seed":3407,"selection_report":{"avg_candidates_per_assay":127.96176864671729,"candidate_pool_rows":1432532,"dropped_after_conflicts_or_caps":1594,"eligible_after_count_thresholds":11195,"mean_train_groups_per_train_assay":1.0,"selected_assays":11195,"total_conflicting_compounds_removed":38920,"train_groups":508216},"sharding":{"merged_from":["shard-00-of-08","shard-01-of-08","shard-02-of-08","shard-03-of-08","shard-04-of-08","shard-05-of-08","shard-06-of-08","shard-07-of-08"],"num_shards":8},"source_dataset_hashes":{"assays_sha256":"4b220df37625a4b006bb5232871c956c150648fb7eac0448b17067f76a06b7b5","measurements_sha256":"1c7cb702e7d694f4c4750e139f09280419ec9d5bcd115f9d0311dfe4c2985ade"},"source_manifest_sha256":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b","split_counts":{"test":1111,"train":8967,"val":1117},"strategy":"compatibility_ranking_v1_sharded_merge","thresholds":{"max_actives_per_assay":48,"max_inactives_per_assay":192,"max_train_groups":0,"min_actives_per_assay":4,"min_inactives_per_assay":16,"negative_sets_per_positive":2,"negatives_per_example":7,"standardize_molecules":true},"train_group_split_counts":{"test":0,"train":508216,"val":0}},"row_counts":{"indexed_candidate_pools":1432532,"indexed_train_rows":508216,"molecules":485576},"source_manifest_sha256":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b"},"prepared_manifest":{"file_hashes":{"DATASET_MANIFEST.json":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b","compat_assays.parquet":"7810ab6f35cc5f2b4fbfe5bade1c1fcf4b59668eb54e0cf273d12cab0e75c54f","compat_candidate_pools.parquet":"dcd3c4ede17c2746be2b2951630561361168336ebc0efd5464fc9c04b874416e","compat_train_groups.parquet":"c9366522d29c4588fed22e4acd825b43ce7897ef3ab103ccd893ac5565a9d5dd"},"prepared_at":"2026-03-08T18:02:14.604717+00:00","row_counts":{"compat_assays":11195,"compat_candidate_pools":1432532,"compat_train_groups":508216},"seed":3407,"selection_report":{"avg_candidates_per_assay":127.96176864671729,"candidate_pool_rows":1432532,"dropped_after_conflicts_or_caps":1594,"eligible_after_count_thresholds":11195,"mean_train_groups_per_train_assay":1.0,"selected_assays":11195,"total_conflicting_compounds_removed":38920,"train_groups":508216},"sharding":{"merged_from":["shard-00-of-08","shard-01-of-08","shard-02-of-08","shard-03-of-08","shard-04-of-08","shard-05-of-08","shard-06-of-08","shard-07-of-08"],"num_shards":8},"source_dataset_hashes":{"assays_sha256":"4b220df37625a4b006bb5232871c956c150648fb7eac0448b17067f76a06b7b5","measurements_sha256":"1c7cb702e7d694f4c4750e139f09280419ec9d5bcd115f9d0311dfe4c2985ade"},"source_manifest_sha256":"e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b","split_counts":{"test":1111,"train":8967,"val":1117},"strategy":"compatibility_ranking_v1_sharded_merge","thresholds":{"max_actives_per_assay":48,"max_inactives_per_assay":192,"max_train_groups":0,"min_actives_per_assay":4,"min_inactives_per_assay":16,"negative_sets_per_positive":2,"negatives_per_example":7,"standardize_molecules":true},"train_group_split_counts":{"test":0,"train":508216,"val":0}}}

training_summary.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"best_epoch":9,"best_metrics":{"epoch":9,"event":"compatibility_eval","step":23823,"train_loss":0.07582188404882256,"val_assays":1117.0,"val_hit_at_10":0.97224709042077,"val_mean_auprc":0.6214489194406888,"val_mean_auroc":0.77668220123392,"val_mean_ndcg50":0.7140369202520261,"val_random_auprc_baseline":0.26780584983601613},"best_val_mean_auprc":0.6214489194406888,"created_at":"2026-03-09T07:00:12.341166+00:00","event":"compatibility_train_complete","metadata_path":"outputs/hf-compatibility-20260309-082913/training_metadata.json","output_dir":"outputs/hf-compatibility-20260309-082913","test_metrics":{"test_assays":1111.0,"test_hit_at_10":0.9738973897389739,"test_mean_auprc":0.6339480076718866,"test_mean_auroc":0.7814796702989308,"test_mean_ndcg50":0.7250422425217068,"test_random_auprc_baseline":0.27485643596723147},"train_groups":508216}