lighteternal's picture
Upload README.md with huggingface_hub
9e2d316 verified
---
tags:
- bioassay
- chemistry
- drug-discovery
- ranking
- assay-conditioning
- rdkit
- qwen
library_name: pytorch
license: mit
---
# BioAssayAlign Qwen3-Embedding-0.6B Compatibility
<p align="center">
<img src="./bioassayalign.png" alt="BioAssayAlign logo" width="280">
</p>
## What this model is
BioAssayAlign is an **assay-conditioned small-molecule ranking model**.
It takes:
- one assay definition
- a submitted list of candidate SMILES
and returns:
- one compatibility score per candidate
- a ranked shortlist for that assay
This model is designed to answer a practical question:
> Given this assay, which molecules in my current candidate list should I screen first?
It is **not**:
- a chatbot
- a generative chemistry model
- a direct potency regressor
- a calibrated probability model
## Companion dataset
Public dataset:
- [BioAssayAlign Assay-Compound Data](https://huggingface.co/datasets/lighteternal/BioAssayAlign-Assay-Compound-Data)
The published model was trained on the prepared compatibility-ranking subset inside that dataset release.
## Intended use
Use this model when you already have a candidate set and want a ranking signal for one assay at a time.
Reasonable uses:
- shortlist triage before wet-lab screening
- retrospective ranking experiments
- assay-conditioned ranking features in a downstream workflow
Not reasonable uses:
- reading the raw score as a probability of success
- predicting exact IC50 / EC50 / Ki values
- comparing raw scores across unrelated runs as if they were globally calibrated
## How to run it locally
This repository is self-contained for inference. You do **not** need the original training codebase to run the published model.
### Install
```bash
python -m pip install -r requirements.txt
```
### Minimal local example
```python
from bioassayalign_compatibility import (
AssayQuery,
load_compatibility_model_from_hub,
rank_compounds,
serialize_assay_query,
)
model = load_compatibility_model_from_hub(
"lighteternal/BioAssayAlign-Qwen3-Embedding-0.6B-Compatibility"
)
assay_text = serialize_assay_query(
AssayQuery(
title="JAK2 inhibition assay",
description="Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.",
organism="Homo sapiens",
readout="luminescence",
assay_format="cell-based",
assay_type="inhibition",
target_uniprot=["O60674"],
)
)
results = rank_compounds(
model,
assay_text=assay_text,
smiles_list=[
"CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1",
"c1ccccc1",
"CCO",
],
)
for row in results:
print(row)
```
### What to provide
Best practice:
- provide structured assay fields rather than one free-form paragraph
- include target, readout, organism, and format when known
- submit one parent or cleaned SMILES per candidate
Recommended assay fields:
- `title`
- `description`
- `organism`
- `readout`
- `assay_format`
- `assay_type`
- `target_uniprot`
The model is reasonably robust to wording changes, but missing metadata can reduce ranking quality.
## Model details
Published artifact configuration:
| Component | Value |
|---|---|
| Assay encoder | `Qwen/Qwen3-Embedding-0.6B` |
| Assay encoder training | Frozen |
| Assay metadata features | Enabled, `128` dims |
| Molecule features | Morgan fingerprints (`r=2,3`, `2048` bits each), chirality, `MACCS`, `30` RDKit descriptors |
| Projection dimension | `512` |
| Hidden dimension | `1024` |
| Dropout | `0.12` |
| Final score | Learned compatibility head output |
Important:
- the published score is **not** a raw embedding dot product
- the ranking comes from the learned scorer head
## Training data
The public artifact was trained on a frozen assay-compound corpus derived from:
- PubChem BioAssay
- ChEMBL
The published model uses the prepared compatibility-ranking subset from:
- [lighteternal/BioAssayAlign-Assay-Compound-Data](https://huggingface.co/datasets/lighteternal/BioAssayAlign-Assay-Compound-Data)
### Prepared training dataset
| Field | Value |
|---|---:|
| Assays | `11,195` |
| Candidate-pool rows | `1,432,532` |
| Training groups | `508,216` |
| Train assays | `8,967` |
| Validation assays | `1,117` |
| Test assays | `1,111` |
### Preparation rules
| Rule | Value |
|---|---:|
| Minimum actives per assay | `4` |
| Minimum inactives per assay | `16` |
| Maximum actives per assay | `48` |
| Maximum inactives per assay | `192` |
| Molecule standardization | Enabled |
| Source manifest SHA256 | `e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b` |
Each training group contains:
- one assay
- one positive compound
- multiple explicit same-assay inactive compounds
This is a **ranking** setup, not a generic text-retrieval setup.
## Training configuration
| Field | Value |
|---|---:|
| Framework | `pytorch_head_only_compatibility_ranking` |
| Learning rate | `1.5e-3` |
| Batch size | `192` |
| Weight decay | `1e-4` |
| Hard-negative fraction | `0.5` |
| Negatives per example | `15` |
| Negative sets per positive | `2` |
| Max epochs | `30` |
| Early stopping patience | `5` |
| Early stopping min delta | `0.001` |
| Best epoch | `9` |
## Results
### Main evaluation
| Split | Mean AUPRC | Random-baseline AUPRC | Hit@10 | Mean AUROC | Mean nDCG@50 |
|---|---:|---:|---:|---:|---:|
| Validation | `0.6214` | `0.2678` | `0.9722` | `0.7767` | `0.7140` |
| Test | `0.6339` | `0.2749` | `0.9739` | `0.7815` | `0.7250` |
Interpretation:
- the model materially beats the random ranking baseline
- it is strongest as a **within-list ranking tool**
- the main output to trust is the ranking order and shortlist separation, not the raw score magnitude
## Score interpretation
The raw output is a learned **compatibility logit-like score**.
What it means:
- higher is better
- differences are meaningful **within the same submitted list**
- absolute values are **not** calibrated across unrelated runs
Example:
- candidate A score: `6.25`
- candidate B score: `-8.65`
- candidate C score: `-23.37`
This does **not** mean A has a literal probability or potency attached to it. It means A ranked substantially above B and C for that submitted assay and candidate set.
For user-facing interpretation, the recommended order is:
1. rank
2. relative shortlist score within the submitted list
3. chemistry context columns
4. raw model score only for debugging or export
If you want a normalized within-list view, you can compute:
- min-max scaling to `0–100`
- or softmax over the submitted list
Those are still **not** calibrated biological probabilities.
## Example predictions
These examples were produced from the published weights.
### Example: JAK2 cell assay
Assay:
- title: `JAK2 inhibition assay`
- description: `Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.`
- organism: `Homo sapiens`
- readout: `luminescence`
- assay format: `cell-based`
- assay type: `inhibition`
- target UniProt: `O60674`
| Rank | Candidate SMILES | Raw score |
|---:|---|---:|
| 1 | `CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1` | `6.2590` |
| 2 | `Cc1cc(=O)n(C)c(=O)[nH]1` | `-8.6542` |
| 3 | `CCO` | `-12.8678` |
| 4 | `CCOc1ccc2nc(N3CCN(C)CC3)n(C)c(=O)c2c1` | `-23.3741` |
### Example: ALDH1A1 fluorescence assay
Assay:
- title: `ALDH1A1 inhibition assay`
- description: `Cell-based fluorescence assay measuring ALDH1A1 inhibition in human cells.`
- organism: `Homo sapiens`
- readout: `fluorescence`
- assay format: `cell-based`
- assay type: `inhibition`
- target UniProt: `P00352`
| Rank | Candidate SMILES | Raw score |
|---:|---|---:|
| 1 | `CCOc1ccccc1` | `-26.9257` |
| 2 | `Cc1cc(=O)n(C)c(=O)[nH]1` | `-38.5073` |
| 3 | `CCN(CC)CCOc1ccccc1` | `-39.1753` |
| 4 | `CCO` | `-42.9016` |
## Limitations
- The score is not a calibrated probability.
- The model does not predict exact potency values.
- The benchmark is assay-held-out, not a universal unseen-scaffold benchmark.
- Public assay data is noisy and assay protocols are heterogeneous.
- Some assays remain difficult and yield only moderate separation.
- Use the model as a ranking aid, not as a stand-alone medicinal chemistry decision system.
## Repository contents
Files provided in this HF model repo:
- `best_model.pt`
- `training_metadata.json`
- `training_summary.json`
- `bioassayalign_compatibility.py`
- `requirements.txt`
## Interactive Space
Use the model in the companion Space:
- `https://huggingface.co/spaces/lighteternal/BioAssayAlign-Compatibility-Explorer`