BERT-updated

Standard BERT architecture with flash_attention_2 and sdpa support added.

This is a shared code repository — it contains no pretrained weights. It is used as the code backend for biological sequence models that share the vanilla BERT architecture (post-LN transformer, learned absolute position embeddings) but have model-specific vocabularies and hyperparameters:

Each of those repos stores weights, tokenizer, and config; their auto_map in config.json points here for the modeling code.

What was changed from stock transformers.BertModel

The standard HF BertModel (transformers 4.57.6) supports sdpa but not flash_attention_2. This repo adds a complete attn_implementation dispatch:

Backend Class Notes
eager BertSelfAttention Standard scaled dot-product, identical to original BERT
sdpa BertSdpaSelfAttention F.scaled_dot_product_attention, bool mask -> additive float mask
flash_attention_2 BertFlashSelfAttention flash_attn_varlen_func for padded inputs, flash_attn_func for unpadded

The rest of the architecture (embeddings, FFN, pooler, weight layout) is unchanged.

Usage

Do not load this repo directly. Load one of the model repos listed above:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNABERT", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/RNABERT", trust_remote_code=True)

# Flash Attention 2
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-3mer", trust_remote_code=True,
                                   attn_implementation="flash_attention_2")

Credits

Modeling code authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

Apache 2.0.

Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support