| --- |
| tags: |
| - bioassay |
| - chemistry |
| - drug-discovery |
| - ranking |
| - assay-conditioning |
| - rdkit |
| - qwen |
| library_name: pytorch |
| license: mit |
| --- |
| |
| # BioAssayAlign Qwen3-Embedding-0.6B Compatibility |
|
|
| <p align="center"> |
| <img src="./bioassayalign.png" alt="BioAssayAlign logo" width="280"> |
| </p> |
|
|
| ## What this model is |
|
|
| BioAssayAlign is an **assay-conditioned small-molecule ranking model**. |
|
|
| It takes: |
| - one assay definition |
| - a submitted list of candidate SMILES |
|
|
| and returns: |
| - one compatibility score per candidate |
| - a ranked shortlist for that assay |
|
|
| This model is designed to answer a practical question: |
|
|
| > Given this assay, which molecules in my current candidate list should I screen first? |
|
|
| It is **not**: |
| - a chatbot |
| - a generative chemistry model |
| - a direct potency regressor |
| - a calibrated probability model |
|
|
| ## Companion dataset |
|
|
| Public dataset: |
| - [BioAssayAlign Assay-Compound Data](https://huggingface.co/datasets/lighteternal/BioAssayAlign-Assay-Compound-Data) |
|
|
| The published model was trained on the prepared compatibility-ranking subset inside that dataset release. |
|
|
| ## Intended use |
|
|
| Use this model when you already have a candidate set and want a ranking signal for one assay at a time. |
|
|
| Reasonable uses: |
| - shortlist triage before wet-lab screening |
| - retrospective ranking experiments |
| - assay-conditioned ranking features in a downstream workflow |
|
|
| Not reasonable uses: |
| - reading the raw score as a probability of success |
| - predicting exact IC50 / EC50 / Ki values |
| - comparing raw scores across unrelated runs as if they were globally calibrated |
|
|
| ## How to run it locally |
|
|
| This repository is self-contained for inference. You do **not** need the original training codebase to run the published model. |
|
|
| ### Install |
|
|
| ```bash |
| python -m pip install -r requirements.txt |
| ``` |
|
|
| ### Minimal local example |
|
|
| ```python |
| from bioassayalign_compatibility import ( |
| AssayQuery, |
| load_compatibility_model_from_hub, |
| rank_compounds, |
| serialize_assay_query, |
| ) |
| |
| model = load_compatibility_model_from_hub( |
| "lighteternal/BioAssayAlign-Qwen3-Embedding-0.6B-Compatibility" |
| ) |
| |
| assay_text = serialize_assay_query( |
| AssayQuery( |
| title="JAK2 inhibition assay", |
| description="Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.", |
| organism="Homo sapiens", |
| readout="luminescence", |
| assay_format="cell-based", |
| assay_type="inhibition", |
| target_uniprot=["O60674"], |
| ) |
| ) |
| |
| results = rank_compounds( |
| model, |
| assay_text=assay_text, |
| smiles_list=[ |
| "CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1", |
| "c1ccccc1", |
| "CCO", |
| ], |
| ) |
| |
| for row in results: |
| print(row) |
| ``` |
|
|
| ### What to provide |
|
|
| Best practice: |
| - provide structured assay fields rather than one free-form paragraph |
| - include target, readout, organism, and format when known |
| - submit one parent or cleaned SMILES per candidate |
|
|
| Recommended assay fields: |
| - `title` |
| - `description` |
| - `organism` |
| - `readout` |
| - `assay_format` |
| - `assay_type` |
| - `target_uniprot` |
|
|
| The model is reasonably robust to wording changes, but missing metadata can reduce ranking quality. |
|
|
| ## Model details |
|
|
| Published artifact configuration: |
|
|
| | Component | Value | |
| |---|---| |
| | Assay encoder | `Qwen/Qwen3-Embedding-0.6B` | |
| | Assay encoder training | Frozen | |
| | Assay metadata features | Enabled, `128` dims | |
| | Molecule features | Morgan fingerprints (`r=2,3`, `2048` bits each), chirality, `MACCS`, `30` RDKit descriptors | |
| | Projection dimension | `512` | |
| | Hidden dimension | `1024` | |
| | Dropout | `0.12` | |
| | Final score | Learned compatibility head output | |
|
|
| Important: |
| - the published score is **not** a raw embedding dot product |
| - the ranking comes from the learned scorer head |
|
|
| ## Training data |
|
|
| The public artifact was trained on a frozen assay-compound corpus derived from: |
| - PubChem BioAssay |
| - ChEMBL |
|
|
| The published model uses the prepared compatibility-ranking subset from: |
| - [lighteternal/BioAssayAlign-Assay-Compound-Data](https://huggingface.co/datasets/lighteternal/BioAssayAlign-Assay-Compound-Data) |
|
|
| ### Prepared training dataset |
|
|
| | Field | Value | |
| |---|---:| |
| | Assays | `11,195` | |
| | Candidate-pool rows | `1,432,532` | |
| | Training groups | `508,216` | |
| | Train assays | `8,967` | |
| | Validation assays | `1,117` | |
| | Test assays | `1,111` | |
|
|
| ### Preparation rules |
|
|
| | Rule | Value | |
| |---|---:| |
| | Minimum actives per assay | `4` | |
| | Minimum inactives per assay | `16` | |
| | Maximum actives per assay | `48` | |
| | Maximum inactives per assay | `192` | |
| | Molecule standardization | Enabled | |
| | Source manifest SHA256 | `e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b` | |
|
|
| Each training group contains: |
| - one assay |
| - one positive compound |
| - multiple explicit same-assay inactive compounds |
|
|
| This is a **ranking** setup, not a generic text-retrieval setup. |
|
|
| ## Training configuration |
|
|
| | Field | Value | |
| |---|---:| |
| | Framework | `pytorch_head_only_compatibility_ranking` | |
| | Learning rate | `1.5e-3` | |
| | Batch size | `192` | |
| | Weight decay | `1e-4` | |
| | Hard-negative fraction | `0.5` | |
| | Negatives per example | `15` | |
| | Negative sets per positive | `2` | |
| | Max epochs | `30` | |
| | Early stopping patience | `5` | |
| | Early stopping min delta | `0.001` | |
| | Best epoch | `9` | |
|
|
| ## Results |
|
|
| ### Main evaluation |
|
|
| | Split | Mean AUPRC | Random-baseline AUPRC | Hit@10 | Mean AUROC | Mean nDCG@50 | |
| |---|---:|---:|---:|---:|---:| |
| | Validation | `0.6214` | `0.2678` | `0.9722` | `0.7767` | `0.7140` | |
| | Test | `0.6339` | `0.2749` | `0.9739` | `0.7815` | `0.7250` | |
|
|
| Interpretation: |
| - the model materially beats the random ranking baseline |
| - it is strongest as a **within-list ranking tool** |
| - the main output to trust is the ranking order and shortlist separation, not the raw score magnitude |
|
|
| ## Score interpretation |
|
|
| The raw output is a learned **compatibility logit-like score**. |
|
|
| What it means: |
| - higher is better |
| - differences are meaningful **within the same submitted list** |
| - absolute values are **not** calibrated across unrelated runs |
|
|
| Example: |
| - candidate A score: `6.25` |
| - candidate B score: `-8.65` |
| - candidate C score: `-23.37` |
|
|
| This does **not** mean A has a literal probability or potency attached to it. It means A ranked substantially above B and C for that submitted assay and candidate set. |
|
|
| For user-facing interpretation, the recommended order is: |
| 1. rank |
| 2. relative shortlist score within the submitted list |
| 3. chemistry context columns |
| 4. raw model score only for debugging or export |
|
|
| If you want a normalized within-list view, you can compute: |
| - min-max scaling to `0–100` |
| - or softmax over the submitted list |
|
|
| Those are still **not** calibrated biological probabilities. |
|
|
| ## Example predictions |
|
|
| These examples were produced from the published weights. |
|
|
| ### Example: JAK2 cell assay |
|
|
| Assay: |
| - title: `JAK2 inhibition assay` |
| - description: `Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.` |
| - organism: `Homo sapiens` |
| - readout: `luminescence` |
| - assay format: `cell-based` |
| - assay type: `inhibition` |
| - target UniProt: `O60674` |
|
|
| | Rank | Candidate SMILES | Raw score | |
| |---:|---|---:| |
| | 1 | `CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1` | `6.2590` | |
| | 2 | `Cc1cc(=O)n(C)c(=O)[nH]1` | `-8.6542` | |
| | 3 | `CCO` | `-12.8678` | |
| | 4 | `CCOc1ccc2nc(N3CCN(C)CC3)n(C)c(=O)c2c1` | `-23.3741` | |
|
|
| ### Example: ALDH1A1 fluorescence assay |
|
|
| Assay: |
| - title: `ALDH1A1 inhibition assay` |
| - description: `Cell-based fluorescence assay measuring ALDH1A1 inhibition in human cells.` |
| - organism: `Homo sapiens` |
| - readout: `fluorescence` |
| - assay format: `cell-based` |
| - assay type: `inhibition` |
| - target UniProt: `P00352` |
|
|
| | Rank | Candidate SMILES | Raw score | |
| |---:|---|---:| |
| | 1 | `CCOc1ccccc1` | `-26.9257` | |
| | 2 | `Cc1cc(=O)n(C)c(=O)[nH]1` | `-38.5073` | |
| | 3 | `CCN(CC)CCOc1ccccc1` | `-39.1753` | |
| | 4 | `CCO` | `-42.9016` | |
|
|
| ## Limitations |
|
|
| - The score is not a calibrated probability. |
| - The model does not predict exact potency values. |
| - The benchmark is assay-held-out, not a universal unseen-scaffold benchmark. |
| - Public assay data is noisy and assay protocols are heterogeneous. |
| - Some assays remain difficult and yield only moderate separation. |
| - Use the model as a ranking aid, not as a stand-alone medicinal chemistry decision system. |
|
|
| ## Repository contents |
|
|
| Files provided in this HF model repo: |
| - `best_model.pt` |
| - `training_metadata.json` |
| - `training_summary.json` |
| - `bioassayalign_compatibility.py` |
| - `requirements.txt` |
|
|
| ## Interactive Space |
|
|
| Use the model in the companion Space: |
|
|
| - `https://huggingface.co/spaces/lighteternal/BioAssayAlign-Compatibility-Explorer` |
|
|