Instructions to use Taykhoom/mRNABERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/mRNABERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Taykhoom/mRNABERT", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("Taykhoom/mRNABERT", trust_remote_code=True) model = AutoModelForMaskedLM.from_pretrained("Taykhoom/mRNABERT", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
mRNABERT
Weights and tokenizer for mRNABERT (Xiong et al., Nature Communications 2025), loaded with the bug-fixed model code from Taykhoom/MosaicBERT-updated.
mRNABERT is a language model pre-trained on 18 million mRNA sequences incorporating contrastive learning to integrate semantic features of amino acids.
This repo contains only weights and tokenizer files. The model code is loaded automatically
from Taykhoom/MosaicBERT-updated via trust_remote_code=True. See that repo for the full list
of bugs fixed relative to the original MosaicBERT implementation.
Architecture
mRNABERT uses the MosaicBERT architecture with an mRNA-specific vocabulary.
| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Vocabulary size | 74 (5 special + 5 single-nt + 64 codons) |
| Positional encoding | ALiBi (no position embeddings) |
| Attention | Flash Attention (packed QKV) |
| FFN | Gated Linear Units (GeGLU) |
| Padding | Unpadding (tokens concatenated, no padding overhead) |
| Max sequence length | 1024 tokens |
| Parameters | ~114M |
Vocabulary
The tokenizer uses BertTokenizer with a hybrid vocabulary. Sequences are encoded in the
DNA alphabet (T, not U) even though the model is trained on mRNA.
| Range | Tokens | Use |
|---|---|---|
| 0-4 | [PAD] [UNK] [CLS] [SEP] [MASK] |
Special tokens |
| 5-9 | A T C G N |
Single nucleotides (UTR regions) |
| 10-73 | AAA ... GGG |
All 64 codons (CDS regions) |
Pretraining
- Objective: Masked Language Modeling + contrastive learning (amino-acid semantic features)
- Data: 18 million curated mRNA sequences
- Source checkpoint:
pytorch_model.binfrom YYLY66/mRNABERT
Parity Verification
Hidden states verified max abs diff < 2.4e-05 at all 13 representation levels
(embedding + 12 transformer layers) relative to the original implementation.
Both models use flash_attn_varlen_qkvpacked_func; the small numerical differences
are flash attention rounding, not a correctness issue.
SDPA vs eager max diff = 1.81e-05. Verified on GPU with PyTorch 2.7 / CUDA 12.9.
Usage
mRNABERT requires CDS-aware preprocessing: UTR regions must be single-nucleotide
space-separated and CDS regions must be codon space-separated. The tokenizer handles
this automatically via batch_encode_with_cds() when a CDS track is available, or
you can pass pre-formatted strings directly for simple use cases.
Sequences use T (not U).
Embedding generation with CDS tracks (recommended)
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/mRNABERT", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/mRNABERT", trust_remote_code=True)
model.eval()
# Raw sequences (T not U) + per-nucleotide CDS track
# cds[i] != 0 marks the start of a codon at position i
sequences = ["ATCGATGTTTCCC", "AATGCCC"]
cds_tracks = [
np.array([0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0]), # CDS starts at pos 3
np.array([0, 1, 0, 0, 1, 0, 0]), # CDS starts at pos 1
]
enc, chunk_counts = tokenizer.batch_encode_with_cds(
sequences, cds_tracks, return_tensors="pt", padding=True
)
with torch.no_grad():
out = model(**enc)
mask = enc["attention_mask"].unsqueeze(-1).float()
mean_emb = (out.last_hidden_state * mask).sum(1) / mask.sum(1) # (batch, 768)
Embedding generation without CDS tracks
Pass pre-formatted space-separated strings directly when no CDS annotation is available:
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/mRNABERT", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/mRNABERT", trust_remote_code=True)
model.eval()
# Space-separated: single nt for UTRs, codons for CDS; use T not U
sequences = [
"A T C G G A GGG CCC TTT AAA", # mixed UTR + CDS
"ATG TTT CCC GAC TAA", # CDS only
]
enc = tokenizer(sequences, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**enc)
mask = enc["attention_mask"].unsqueeze(-1).float()
mean_emb = (out.last_hidden_state * mask).sum(1) / mask.sum(1) # (batch, 768)
MLM logits
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/mRNABERT", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/mRNABERT", trust_remote_code=True)
model.eval()
enc = tokenizer(["A T C G [MASK] CCC TTT"], return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits # (1, seq_len, 74)
Attention implementation
# SDPA (default on PyTorch >= 2.0)
model = AutoModel.from_pretrained("Taykhoom/mRNABERT", trust_remote_code=True,
attn_implementation="sdpa")
# Flash Attention 2 (requires: pip install flash-attn --no-build-isolation)
model = AutoModel.from_pretrained("Taykhoom/mRNABERT", trust_remote_code=True,
attn_implementation="flash_attention_2")
Fine-tuning
import torch.nn as nn
from transformers import AutoModel
class mRNABERTClassifier(nn.Module):
def __init__(self, num_labels):
super().__init__()
self.encoder = AutoModel.from_pretrained("Taykhoom/mRNABERT", trust_remote_code=True)
self.head = nn.Linear(768, num_labels)
def forward(self, input_ids, attention_mask):
out = self.encoder(input_ids, attention_mask=attention_mask)
mask = attention_mask.unsqueeze(-1).float()
pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1)
return self.head(pooled)
Citation
@article{xiong2025_mrnabert,
title = {{mRNABERT}: advancing {mRNA} sequence design with a universal language model and comprehensive dataset},
author = {Xiong, Ying and Wang, Aowen, and Kang, Yu and Shen, Chao and Hsieh, Chang-Yu and Hou, Tingjun},
journal = {Nature Communications},
volume = {16},
number = {1},
pages = {10371},
year = {2025},
doi = {10.1038/s41467-025-65340-8}
}
Credits
Original mRNABERT model and weights by Xiong et al. Source: GitHub. Bug-fixed model code by Taykhoom/MosaicBERT-updated, authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.
License
Apache 2.0, following the original repository.
- Downloads last month
- 98