CRISPR-BERT
CRISPR-BERT is a TensorFlow/Keras model for detecting CRISPR arrays in prokaryotic DNA sequences. It scores each nucleotide position in a 1000 bp input window with a probability in [0, 1], where higher values indicate stronger evidence that the position belongs to a CRISPR array.
The checkpoint in this repository is used by the CRISPR Array Detection web app and companion inference code.
Model Details
| Property | Value |
|---|---|
| Model type | BERT-like transformer for DNA sequence labeling |
| Framework | TensorFlow / Keras |
| Checkpoint | best.h5 |
| Artifact size | 5.15 GB, stored with Git LFS |
| Input | 1000 bp DNA window |
| Output | Per-position CRISPR probability, shape (batch, 1000, 1) |
| Tokenization | PAD/OOV=0, A=1, C=2, G=3, T=4, ambiguous IUPAC bases as 5 |
| Architecture inspected from local training reports | 24 transformer blocks, hidden size 600, about 300.7M parameters |
| Fine-tuning objective | Per-position binary CRISPR array detection |
The model uses custom Keras layers and should be loaded with the custom layer definitions from the companion inference code.
Intended Use
Use this model to:
- Predict CRISPR-array probability scores along bacterial or archaeal DNA sequences.
- Detect candidate CRISPR array regions using sliding windows and thresholding.
- Extract hidden-state embeddings for state-dynamic visualizations of repeat/spacer structure.
This model is intended for research use. It is not intended for clinical, diagnostic, or safety-critical use.
Training Data
The model was initialized from a BERT-style model pretrained on metagenomic contigs and complete microbial genomes, then fine-tuned on annotated CRISPR array sequences. Positive examples come from annotated CRISPR arrays; negative examples are sampled from non-CRISPR genomic regions.
Evaluation
The local benchmark report for the fine-tuned checkpoint family used a ground-truth test split with:
| Quantity | Value |
|---|---|
| Windows | 2,840 |
| Total bases | 2,840,000 |
| Positive bases | 375,163 |
| Positive-base prevalence | 13.21% |
Metrics from the available benchmark report:
| Metric | Value |
|---|---|
| Micro AUPRC | 0.9802 |
| Micro AUROC | 0.9910 |
| Precision at threshold 0.5 | 0.9809 |
| Recall at threshold 0.5 | 0.8828 |
| F1 at threshold 0.5 | 0.9293 |
| Best F1 over threshold grid | 0.9543 |
| Best threshold over grid | 0.31 |
| Window-level detection F1 at threshold 0.5 | 0.9255 |
Threshold choice affects sensitivity and specificity. The companion app defaults to a threshold around 0.3 for sensitive region discovery.
Usage
The easiest way to run the model is through the companion Hugging Face Space:
https://huggingface.co/spaces/genomenet/crispr-array-detection
For local inference, use the custom inference code from the CRISPR Array Detection app. A minimal loading pattern is:
import tensorflow as tf
from huggingface_hub import hf_hub_download
from inference.custom_layers import get_custom_objects
model_path = hf_hub_download(
repo_id="genomenet/crispr-bert-model",
filename="best.h5",
)
model = tf.keras.models.load_model(
model_path,
custom_objects=get_custom_objects(),
compile=False,
)
Input sequences should be converted to integer tokens using the same tokenizer used during training:
TOKEN = {
"A": 1,
"C": 2,
"G": 3,
"T": 4,
}
# Unknown and ambiguous IUPAC bases are encoded as 5.
# Padding/OOV is encoded as 0.
For sequences longer than 1000 bp, use sliding windows and aggregate overlapping per-window predictions back to sequence coordinates.
Limitations
- The model was developed for prokaryotic genomic sequence contexts; performance may differ on eukaryotic, viral, synthetic, or heavily assembled/metagenomic fragments.
- The model emits probability-like scores, not curated biological annotations. Downstream thresholding and manual review are recommended.
- Very short sequences are not supported by the deployed inference app; the model expects 1000 bp windows.
- Ambiguous bases are supported but may reduce confidence if frequent.
- Evaluation metrics depend on the benchmark split and annotation quality.
Citation and Acknowledgements
If you use this model, please cite or acknowledge:
- Ziyu Mu, Master's Thesis, Helmholtz Centre for Infection Research (HZI BIFO), 2024.
- DFG SPP 2141 "Much more than Defence: the Multiple Functions and Facets of CRISPR-Cas" (project MC 172).
- BMBF de.NBI / GenomeNet.
Contact
For questions about the model or deployment, contact the GenomeNet / HZI maintainers of this repository.
- Downloads last month
- -
Space using genomenet/crispr-bert-model 1
Evaluation results
- Micro AUPRC on Ground-truth CRISPR array test splitself-reported0.980
- Micro AUROC on Ground-truth CRISPR array test splitself-reported0.991
- Best micro F1 over threshold grid on Ground-truth CRISPR array test splitself-reported0.954
- Window-level F1 at threshold 0.5 on Ground-truth CRISPR array test splitself-reported0.925