BERT-updated

Standard BERT architecture with flash_attention_2 and sdpa support added.

This is a shared code repository — it contains no pretrained weights. It is used as the code backend for biological sequence models that share the vanilla BERT architecture (post-LN transformer, learned absolute position embeddings) but have model-specific vocabularies and hyperparameters:

Each of those repos stores weights, tokenizer, and config; their auto_map in config.json points here for the modeling code.

What was changed from stock `transformers.BertModel`

The standard HF BertModel (transformers 4.57.6) supports sdpa but not flash_attention_2. This repo adds a complete attn_implementation dispatch:

Backend	Class	Notes
`eager`	`BertSelfAttention`	Standard scaled dot-product, identical to original BERT
`sdpa`	`BertSdpaSelfAttention`	`F.scaled_dot_product_attention`, bool mask -> additive float mask
`flash_attention_2`	`BertFlashSelfAttention`	`flash_attn_varlen_func` for padded inputs, `flash_attn_func` for unpadded

The rest of the architecture (embeddings, FFN, pooler, weight layout) is unchanged.

Usage

Do not load this repo directly. Load one of the model repos listed above:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNABERT", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/RNABERT", trust_remote_code=True)

# Flash Attention 2
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-3mer", trust_remote_code=True,
                                   attn_implementation="flash_attention_2")

Credits

Modeling code authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

Apache 2.0.

Downloads last month: 27

BERT-updated

What was changed from stock transformers.BertModel

Usage

Credits

License

What was changed from stock `transformers.BertModel`