GenomeClip: CLIP-style Dual-Encoder for DNA Sequence and Text Embeddings
GenomeClip is a contrastive learning model that aligns DNA sequence embeddings
(from AlphaGenome) with
text embeddings (from GenePT /
OpenAI text-embedding-3-large) into a shared 512-dimensional space.
Model Overview
| Property | Value |
|---|---|
| Architecture | Dual-tower (CLIP-style) |
| Sequence encoder | Linear projection β Transformer encoder β mean pooling β 512-d |
| Text encoder | Linear projection β MLP β 512-d |
| Input sequence dim | 3072 (AlphaGenome embeddings_128bp) |
| Input text dim | 3072 (OpenAI text-embedding-3-large) |
| Output dim | 512 (L2-normalized) |
| Parameters | 58.8M |
How It Works
GenomeClip is a second-stage alignment model. It does not process raw DNA sequences or raw text directly. Instead, it takes pre-computed embeddings from upstream foundation models and projects them into a shared space:
DNA sequence "ATCG..."
β AlphaGenome (embeddings_128bp)
β per-gene tokens (L, 3072)
β L2-normalize each token β important preprocessing step
β GenomeClip sequence encoder
β (512,) L2-normalized embedding
β cosine similarity
Gene description "This gene encodes..."
β OpenAI text-embedding-3-large
β (3072,) vector
β GenomeClip text encoder
β (512,) L2-normalized embedding
The two encoders are completely independent (no shared weights, no cross-attention), so you can encode sequences and text separately and compare them later via cosine similarity.
Quick Start
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained(
"your-username/GenomeClip-v1",
trust_remote_code=True,
)
model.eval()
# Encode DNA sequence embeddings (from AlphaGenome)
# seq_emb: (batch, num_tokens, 3072) β L2-normalized per token
# seq_lengths: (batch,) β number of valid tokens per sample
seq_emb = torch.randn(2, 50, 3072)
seq_emb = torch.nn.functional.normalize(seq_emb, dim=-1) # L2-norm each token
seq_lengths = torch.tensor([50, 35])
with torch.no_grad():
seq_repr = model.encode_sequence(seq_emb, seq_lengths) # (2, 512)
# Encode text embeddings (from OpenAI text-embedding-3-large)
text_emb = torch.randn(2, 3072)
with torch.no_grad():
text_repr = model.encode_text(text_emb) # (2, 512)
# Cross-modal similarity
similarity = seq_repr @ text_repr.t() # (2, 2) cosine similarity matrix
print(similarity)
Input Format
Sequence Embeddings
- Source: AlphaGenome
embeddings_128bp(3072-dim per 128bp window) - Shape:
(batch, L, 3072)whereL = ceil(gene_length_bp / 128) - Preprocessing: L2-normalize each token before feeding to GenomeClip
seq_emb = torch.nn.functional.normalize(seq_emb, dim=-1) - Also accepts pooled input
(batch, 3072)(auto-expanded toL=1)
Text Embeddings
- Source: OpenAI
text-embedding-3-largeapplied to NCBI gene summaries (following the GenePT methodology) - Shape:
(batch, 3072) - No preprocessing needed β use the embedding as-is from the API
Sequence Lengths
- Shape:
(batch,)β number of valid (non-padding) tokens per sample - Optional. If not provided, all positions are assumed valid
Usage Patterns
Encode sequence only
seq_repr = model.encode_sequence(seq_emb, seq_lengths) # (B, 512)
Encode text only
text_repr = model.encode_text(text_emb) # (B, 512)
Encode both and compute contrastive loss
out = model(
seq_embeddings=seq_emb,
text_embeddings=text_emb,
seq_lengths=seq_lengths,
)
# out.seq_repr: (B, 512)
# out.text_repr: (B, 512)
# out.loss: scalar (symmetric InfoNCE)
# out.logits: (B, B) similarity matrix
Cross-modal retrieval
# Pre-compute all gene embeddings (do this once)
all_seq_reprs = []
for batch in seq_dataloader:
with torch.no_grad():
all_seq_reprs.append(model.encode_sequence(batch["seq"], batch["lengths"]))
all_seq_reprs = torch.cat(all_seq_reprs) # (N_genes, 512)
# Query: find genes matching a text description
query_text_repr = model.encode_text(query_text_emb) # (1, 512)
similarities = query_text_repr @ all_seq_reprs.t() # (1, N_genes)
top_matches = similarities.argsort(descending=True)[0, :10]
Upstream Model Setup
AlphaGenome (for DNA sequence embeddings)
pip install alphagenome-research
# See https://huggingface.co/google/alphagenome-all-folds for full setup
from alphagenome_research.model.one_hot_encoder import DNAOneHotEncoder
encoder = DNAOneHotEncoder()
one_hot = encoder.encode(dna_sequence) # (seq_len, 4)
# ... run AlphaGenome model ...
# Extract: result.embeddings_128bp β (L, 3072)
GenePT text embeddings (via OpenAI API)
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
input="BRCA1 DNA repair associated. This gene encodes a nuclear phosphoprotein...",
model="text-embedding-3-large",
)
text_emb = response.data[0].embedding # list of 3072 floats
Citation
If you use GenomeClip in your research, please cite:
@misc{genomeclip2025,
title={GenomeClip: Contrastive Alignment of DNA Sequence and Text Embeddings},
year={2025},
}
License
Apache 2.0
- Downloads last month
- 16
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support