Upload README.md with huggingface_hub

9e2d316 verified about 2 months ago

8.53 kB

	---
	tags:
	- bioassay
	- chemistry
	- drug-discovery
	- ranking
	- assay-conditioning
	- rdkit
	- qwen
	library_name: pytorch
	license: mit
	---

	# BioAssayAlign Qwen3-Embedding-0.6B Compatibility

	<p align="center">
	<img src="./bioassayalign.png" alt="BioAssayAlign logo" width="280">
	</p>

	## What this model is

	BioAssayAlign is an assay-conditioned small-molecule ranking model.

	It takes:
	- one assay definition
	- a submitted list of candidate SMILES

	and returns:
	- one compatibility score per candidate
	- a ranked shortlist for that assay

	This model is designed to answer a practical question:

	> Given this assay, which molecules in my current candidate list should I screen first?

	It is not:
	- a chatbot
	- a generative chemistry model
	- a direct potency regressor
	- a calibrated probability model

	## Companion dataset

	Public dataset:
	- [BioAssayAlign Assay-Compound Data](https://huggingface.co/datasets/lighteternal/BioAssayAlign-Assay-Compound-Data)

	The published model was trained on the prepared compatibility-ranking subset inside that dataset release.

	## Intended use

	Use this model when you already have a candidate set and want a ranking signal for one assay at a time.

	Reasonable uses:
	- shortlist triage before wet-lab screening
	- retrospective ranking experiments
	- assay-conditioned ranking features in a downstream workflow

	Not reasonable uses:
	- reading the raw score as a probability of success
	- predicting exact IC50 / EC50 / Ki values
	- comparing raw scores across unrelated runs as if they were globally calibrated

	## How to run it locally

	This repository is self-contained for inference. You do not need the original training codebase to run the published model.

	### Install

	```bash
	python -m pip install -r requirements.txt
	```

	### Minimal local example

	```python
	from bioassayalign_compatibility import (
	AssayQuery,
	load_compatibility_model_from_hub,
	rank_compounds,
	serialize_assay_query,
	)

	model = load_compatibility_model_from_hub(
	"lighteternal/BioAssayAlign-Qwen3-Embedding-0.6B-Compatibility"
	)

	assay_text = serialize_assay_query(
	AssayQuery(
	title="JAK2 inhibition assay",
	description="Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.",
	organism="Homo sapiens",
	readout="luminescence",
	assay_format="cell-based",
	assay_type="inhibition",
	target_uniprot=["O60674"],
	)
	)

	results = rank_compounds(
	model,
	assay_text=assay_text,
	smiles_list=[
	"CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1",
	"c1ccccc1",
	"CCO",
	],
	)

	for row in results:
	print(row)
	```

	### What to provide

	Best practice:
	- provide structured assay fields rather than one free-form paragraph
	- include target, readout, organism, and format when known
	- submit one parent or cleaned SMILES per candidate

	Recommended assay fields:
	- `title`
	- `description`
	- `organism`
	- `readout`
	- `assay_format`
	- `assay_type`
	- `target_uniprot`

	The model is reasonably robust to wording changes, but missing metadata can reduce ranking quality.

	## Model details

	Published artifact configuration:

	\| Component \| Value \|
	\|---\|---\|
	\| Assay encoder \| `Qwen/Qwen3-Embedding-0.6B` \|
	\| Assay encoder training \| Frozen \|
	\| Assay metadata features \| Enabled, `128` dims \|
	\| Molecule features \| Morgan fingerprints (`r=2,3`, `2048` bits each), chirality, `MACCS`, `30` RDKit descriptors \|
	\| Projection dimension \| `512` \|
	\| Hidden dimension \| `1024` \|
	\| Dropout \| `0.12` \|
	\| Final score \| Learned compatibility head output \|

	Important:
	- the published score is not a raw embedding dot product
	- the ranking comes from the learned scorer head

	## Training data

	The public artifact was trained on a frozen assay-compound corpus derived from:
	- PubChem BioAssay
	- ChEMBL

	The published model uses the prepared compatibility-ranking subset from:
	- [lighteternal/BioAssayAlign-Assay-Compound-Data](https://huggingface.co/datasets/lighteternal/BioAssayAlign-Assay-Compound-Data)

	### Prepared training dataset

	\| Field \| Value \|
	\|---\|---:\|
	\| Assays \| `11,195` \|
	\| Candidate-pool rows \| `1,432,532` \|
	\| Training groups \| `508,216` \|
	\| Train assays \| `8,967` \|
	\| Validation assays \| `1,117` \|
	\| Test assays \| `1,111` \|

	### Preparation rules

	\| Rule \| Value \|
	\|---\|---:\|
	\| Minimum actives per assay \| `4` \|
	\| Minimum inactives per assay \| `16` \|
	\| Maximum actives per assay \| `48` \|
	\| Maximum inactives per assay \| `192` \|
	\| Molecule standardization \| Enabled \|
	\| Source manifest SHA256 \| `e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b` \|

	Each training group contains:
	- one assay
	- one positive compound
	- multiple explicit same-assay inactive compounds

	This is a ranking setup, not a generic text-retrieval setup.

	## Training configuration

	\| Field \| Value \|
	\|---\|---:\|
	\| Framework \| `pytorch_head_only_compatibility_ranking` \|
	\| Learning rate \| `1.5e-3` \|
	\| Batch size \| `192` \|
	\| Weight decay \| `1e-4` \|
	\| Hard-negative fraction \| `0.5` \|
	\| Negatives per example \| `15` \|
	\| Negative sets per positive \| `2` \|
	\| Max epochs \| `30` \|
	\| Early stopping patience \| `5` \|
	\| Early stopping min delta \| `0.001` \|
	\| Best epoch \| `9` \|

	## Results

	### Main evaluation

	\| Split \| Mean AUPRC \| Random-baseline AUPRC \| Hit@10 \| Mean AUROC \| Mean nDCG@50 \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| Validation \| `0.6214` \| `0.2678` \| `0.9722` \| `0.7767` \| `0.7140` \|
	\| Test \| `0.6339` \| `0.2749` \| `0.9739` \| `0.7815` \| `0.7250` \|

	Interpretation:
	- the model materially beats the random ranking baseline
	- it is strongest as a within-list ranking tool
	- the main output to trust is the ranking order and shortlist separation, not the raw score magnitude

	## Score interpretation

	The raw output is a learned compatibility logit-like score.

	What it means:
	- higher is better
	- differences are meaningful within the same submitted list
	- absolute values are not calibrated across unrelated runs

	Example:
	- candidate A score: `6.25`
	- candidate B score: `-8.65`
	- candidate C score: `-23.37`

	This does not mean A has a literal probability or potency attached to it. It means A ranked substantially above B and C for that submitted assay and candidate set.

	For user-facing interpretation, the recommended order is:
	1. rank
	2. relative shortlist score within the submitted list
	3. chemistry context columns
	4. raw model score only for debugging or export

	If you want a normalized within-list view, you can compute:
	- min-max scaling to `0–100`
	- or softmax over the submitted list

	Those are still not calibrated biological probabilities.

	## Example predictions

	These examples were produced from the published weights.

	### Example: JAK2 cell assay

	Assay:
	- title: `JAK2 inhibition assay`
	- description: `Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.`
	- organism: `Homo sapiens`
	- readout: `luminescence`
	- assay format: `cell-based`
	- assay type: `inhibition`
	- target UniProt: `O60674`

	\| Rank \| Candidate SMILES \| Raw score \|
	\|---:\|---\|---:\|
	\| 1 \| `CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1` \| `6.2590` \|
	\| 2 \| `Cc1cc(=O)n(C)c(=O)[nH]1` \| `-8.6542` \|
	\| 3 \| `CCO` \| `-12.8678` \|
	\| 4 \| `CCOc1ccc2nc(N3CCN(C)CC3)n(C)c(=O)c2c1` \| `-23.3741` \|

	### Example: ALDH1A1 fluorescence assay

	Assay:
	- title: `ALDH1A1 inhibition assay`
	- description: `Cell-based fluorescence assay measuring ALDH1A1 inhibition in human cells.`
	- organism: `Homo sapiens`
	- readout: `fluorescence`
	- assay format: `cell-based`
	- assay type: `inhibition`
	- target UniProt: `P00352`

	\| Rank \| Candidate SMILES \| Raw score \|
	\|---:\|---\|---:\|
	\| 1 \| `CCOc1ccccc1` \| `-26.9257` \|
	\| 2 \| `Cc1cc(=O)n(C)c(=O)[nH]1` \| `-38.5073` \|
	\| 3 \| `CCN(CC)CCOc1ccccc1` \| `-39.1753` \|
	\| 4 \| `CCO` \| `-42.9016` \|

	## Limitations

	- The score is not a calibrated probability.
	- The model does not predict exact potency values.
	- The benchmark is assay-held-out, not a universal unseen-scaffold benchmark.
	- Public assay data is noisy and assay protocols are heterogeneous.
	- Some assays remain difficult and yield only moderate separation.
	- Use the model as a ranking aid, not as a stand-alone medicinal chemistry decision system.

	## Repository contents

	Files provided in this HF model repo:
	- `best_model.pt`
	- `training_metadata.json`
	- `training_summary.json`
	- `bioassayalign_compatibility.py`
	- `requirements.txt`

	## Interactive Space

	Use the model in the companion Space:

	- `https://huggingface.co/spaces/lighteternal/BioAssayAlign-Compatibility-Explorer`