GenomeClip: CLIP-style Dual-Encoder for DNA Sequence and Text Embeddings

GenomeClip is a contrastive learning model that aligns DNA sequence embeddings (from AlphaGenome) with text embeddings (from GenePT / OpenAI text-embedding-3-large) into a shared 512-dimensional space.

Model Overview

Property Value
Architecture Dual-tower (CLIP-style)
Sequence encoder Linear projection β†’ Transformer encoder β†’ mean pooling β†’ 512-d
Text encoder Linear projection β†’ MLP β†’ 512-d
Input sequence dim 3072 (AlphaGenome embeddings_128bp)
Input text dim 3072 (OpenAI text-embedding-3-large)
Output dim 512 (L2-normalized)
Parameters 58.8M

How It Works

GenomeClip is a second-stage alignment model. It does not process raw DNA sequences or raw text directly. Instead, it takes pre-computed embeddings from upstream foundation models and projects them into a shared space:

DNA sequence "ATCG..."
    β†’ AlphaGenome (embeddings_128bp)
    β†’ per-gene tokens (L, 3072)
    β†’ L2-normalize each token               ← important preprocessing step
    β†’ GenomeClip sequence encoder
    β†’ (512,) L2-normalized embedding
                                              ↕ cosine similarity
Gene description "This gene encodes..."
    β†’ OpenAI text-embedding-3-large
    β†’ (3072,) vector
    β†’ GenomeClip text encoder
    β†’ (512,) L2-normalized embedding

The two encoders are completely independent (no shared weights, no cross-attention), so you can encode sequences and text separately and compare them later via cosine similarity.

Quick Start

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "your-username/GenomeClip-v1",
    trust_remote_code=True,
)
model.eval()

# Encode DNA sequence embeddings (from AlphaGenome)
# seq_emb: (batch, num_tokens, 3072) β€” L2-normalized per token
# seq_lengths: (batch,) β€” number of valid tokens per sample
seq_emb = torch.randn(2, 50, 3072)
seq_emb = torch.nn.functional.normalize(seq_emb, dim=-1)  # L2-norm each token
seq_lengths = torch.tensor([50, 35])

with torch.no_grad():
    seq_repr = model.encode_sequence(seq_emb, seq_lengths)  # (2, 512)

# Encode text embeddings (from OpenAI text-embedding-3-large)
text_emb = torch.randn(2, 3072)

with torch.no_grad():
    text_repr = model.encode_text(text_emb)  # (2, 512)

# Cross-modal similarity
similarity = seq_repr @ text_repr.t()  # (2, 2) cosine similarity matrix
print(similarity)

Input Format

Sequence Embeddings

  • Source: AlphaGenome embeddings_128bp (3072-dim per 128bp window)
  • Shape: (batch, L, 3072) where L = ceil(gene_length_bp / 128)
  • Preprocessing: L2-normalize each token before feeding to GenomeClip
    seq_emb = torch.nn.functional.normalize(seq_emb, dim=-1)
    
  • Also accepts pooled input (batch, 3072) (auto-expanded to L=1)

Text Embeddings

  • Source: OpenAI text-embedding-3-large applied to NCBI gene summaries (following the GenePT methodology)
  • Shape: (batch, 3072)
  • No preprocessing needed β€” use the embedding as-is from the API

Sequence Lengths

  • Shape: (batch,) β€” number of valid (non-padding) tokens per sample
  • Optional. If not provided, all positions are assumed valid

Usage Patterns

Encode sequence only

seq_repr = model.encode_sequence(seq_emb, seq_lengths)  # (B, 512)

Encode text only

text_repr = model.encode_text(text_emb)  # (B, 512)

Encode both and compute contrastive loss

out = model(
    seq_embeddings=seq_emb,
    text_embeddings=text_emb,
    seq_lengths=seq_lengths,
)
# out.seq_repr:  (B, 512)
# out.text_repr: (B, 512)
# out.loss:      scalar (symmetric InfoNCE)
# out.logits:    (B, B) similarity matrix

Cross-modal retrieval

# Pre-compute all gene embeddings (do this once)
all_seq_reprs = []
for batch in seq_dataloader:
    with torch.no_grad():
        all_seq_reprs.append(model.encode_sequence(batch["seq"], batch["lengths"]))
all_seq_reprs = torch.cat(all_seq_reprs)  # (N_genes, 512)

# Query: find genes matching a text description
query_text_repr = model.encode_text(query_text_emb)  # (1, 512)
similarities = query_text_repr @ all_seq_reprs.t()    # (1, N_genes)
top_matches = similarities.argsort(descending=True)[0, :10]

Upstream Model Setup

AlphaGenome (for DNA sequence embeddings)

pip install alphagenome-research
# See https://huggingface.co/google/alphagenome-all-folds for full setup
from alphagenome_research.model.one_hot_encoder import DNAOneHotEncoder

encoder = DNAOneHotEncoder()
one_hot = encoder.encode(dna_sequence)  # (seq_len, 4)
# ... run AlphaGenome model ...
# Extract: result.embeddings_128bp β†’ (L, 3072)

GenePT text embeddings (via OpenAI API)

from openai import OpenAI

client = OpenAI()
response = client.embeddings.create(
    input="BRCA1 DNA repair associated. This gene encodes a nuclear phosphoprotein...",
    model="text-embedding-3-large",
)
text_emb = response.data[0].embedding  # list of 3072 floats

Citation

If you use GenomeClip in your research, please cite:

@misc{genomeclip2025,
  title={GenomeClip: Contrastive Alignment of DNA Sequence and Text Embeddings},
  year={2025},
}

License

Apache 2.0

Downloads last month
16
Safetensors
Model size
58.8M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support