CRISPR-BERT

CRISPR-BERT is a TensorFlow/Keras model for detecting CRISPR arrays in prokaryotic DNA sequences. It scores each nucleotide position in a 1000 bp input window with a probability in [0, 1], where higher values indicate stronger evidence that the position belongs to a CRISPR array.

The checkpoint in this repository is used by the CRISPR Array Detection web app and companion inference code.

Model Details

Property	Value
Model type	BERT-like transformer for DNA sequence labeling
Framework	TensorFlow / Keras
Checkpoint	`best.h5`
Artifact size	5.15 GB, stored with Git LFS
Input	1000 bp DNA window
Output	Per-position CRISPR probability, shape `(batch, 1000, 1)`
Tokenization	`PAD/OOV=0`, `A=1`, `C=2`, `G=3`, `T=4`, ambiguous IUPAC bases as `5`
Architecture inspected from local training reports	24 transformer blocks, hidden size 600, about 300.7M parameters
Fine-tuning objective	Per-position binary CRISPR array detection

The model uses custom Keras layers and should be loaded with the custom layer definitions from the companion inference code.

Intended Use

Use this model to:

Predict CRISPR-array probability scores along bacterial or archaeal DNA sequences.
Detect candidate CRISPR array regions using sliding windows and thresholding.
Extract hidden-state embeddings for state-dynamic visualizations of repeat/spacer structure.

This model is intended for research use. It is not intended for clinical, diagnostic, or safety-critical use.

Training Data

The model was initialized from a BERT-style model pretrained on metagenomic contigs and complete microbial genomes, then fine-tuned on annotated CRISPR array sequences. Positive examples come from annotated CRISPR arrays; negative examples are sampled from non-CRISPR genomic regions.

Evaluation

The local benchmark report for the fine-tuned checkpoint family used a ground-truth test split with:

Quantity	Value
Windows	2,840
Total bases	2,840,000
Positive bases	375,163
Positive-base prevalence	13.21%

Metrics from the available benchmark report:

Metric	Value
Micro AUPRC	0.9802
Micro AUROC	0.9910
Precision at threshold 0.5	0.9809
Recall at threshold 0.5	0.8828
F1 at threshold 0.5	0.9293
Best F1 over threshold grid	0.9543
Best threshold over grid	0.31
Window-level detection F1 at threshold 0.5	0.9255

Threshold choice affects sensitivity and specificity. The companion app defaults to a threshold around 0.3 for sensitive region discovery.

Usage

The easiest way to run the model is through the companion Hugging Face Space:

https://huggingface.co/spaces/genomenet/crispr-array-detection

For local inference, use the custom inference code from the CRISPR Array Detection app. A minimal loading pattern is:

import tensorflow as tf
from huggingface_hub import hf_hub_download

from inference.custom_layers import get_custom_objects

model_path = hf_hub_download(
    repo_id="genomenet/crispr-bert-model",
    filename="best.h5",
)

model = tf.keras.models.load_model(
    model_path,
    custom_objects=get_custom_objects(),
    compile=False,
)

Input sequences should be converted to integer tokens using the same tokenizer used during training:

TOKEN = {
    "A": 1,
    "C": 2,
    "G": 3,
    "T": 4,
}
# Unknown and ambiguous IUPAC bases are encoded as 5.
# Padding/OOV is encoded as 0.

For sequences longer than 1000 bp, use sliding windows and aggregate overlapping per-window predictions back to sequence coordinates.

Limitations

The model was developed for prokaryotic genomic sequence contexts; performance may differ on eukaryotic, viral, synthetic, or heavily assembled/metagenomic fragments.
The model emits probability-like scores, not curated biological annotations. Downstream thresholding and manual review are recommended.
Very short sequences are not supported by the deployed inference app; the model expects 1000 bp windows.
Ambiguous bases are supported but may reduce confidence if frequent.
Evaluation metrics depend on the benchmark split and annotation quality.

Citation and Acknowledgements

If you use this model, please cite or acknowledge:

Ziyu Mu, Master's Thesis, Helmholtz Centre for Infection Research (HZI BIFO), 2024.
DFG SPP 2141 "Much more than Defence: the Multiple Functions and Facets of CRISPR-Cas" (project MC 172).
BMBF de.NBI / GenomeNet.

Contact

For questions about the model or deployment, contact the GenomeNet / HZI maintainers of this repository.

Downloads last month: -

Space using genomenet/crispr-bert-model 1

Evaluation results

Micro AUPRC on Ground-truth CRISPR array test split
self-reported

0.980
Micro AUROC on Ground-truth CRISPR array test split
self-reported

0.991
Best micro F1 over threshold grid on Ground-truth CRISPR array test split
self-reported

0.954
Window-level F1 at threshold 0.5 on Ground-truth CRISPR array test split
self-reported

0.925