mmbert-rerank-32k-2d-matryoshka

A 32K context multilingual reranker with 2D Matryoshka support, enabling flexible layer and dimension reduction for efficient inference.

Model Description

This model is a cross-encoder reranker built on mmbert-32k-yarn, a ModernBERT model extended to 32K context via YaRN position interpolation and trained on 1800+ languages from the Glot500 corpus.

Key Features

Feature	Value
Context Length	32,768 tokens (64× longer than typical 512-token rerankers)
Languages	1800+ (Glot500 multilingual)
Parameters	308M
Architecture	ModernBERT + 2D Matryoshka Cross-Encoder
Flash Attention	✅ Supported

2D Matryoshka Architecture

This model implements 2D Matryoshka Representation Learning, allowing you to trade quality for speed by using:

Fewer layers: 3, 6, 11, or 22 (full)
Fewer dimensions: 64, 128, 256, 512, or 768 (full)

This provides 20 different quality-speed configurations from a single model.

Benchmark Results

Long Document Reranking (Key Advantage)

We evaluate on synthetic long documents (~1500 tokens) where the relevant answer is placed at different positions. This tests whether models can use their full context window.

Test Case	mmbert-rerank	BGE-reranker-v2-m3	Qwen3-Reranker-0.6B
Answer at START	100%	100%	100%
Answer at END	100%	33% ❌	100%

Key Finding: BGE truncates inputs at 512 tokens, causing it to completely miss relevant information beyond that limit. mmbert-rerank maintains 100% accuracy with its 32K context window.

Latency Benchmark

Measured on AMD MI300X GPU:

Metric	mmbert-rerank	BGE-reranker-v2-m3	Qwen3-Reranker-0.6B
Short doc (~15 tokens)	19.0ms	8.3ms	21.5ms
Long doc (~1500 tokens)	24.0ms	11.9ms*	28.8ms
Throughput (batch 10)	506 q/s	1193 q/s	422 q/s

*BGE truncates at 512 tokens, so it doesn't actually process long docs

Multilingual Performance

Tested on 8 languages including low-resource African languages (Swahili, Yoruba, Amharic, Hausa, Igbo):

Language Type	mmbert-rerank	BGE-reranker-v2-m3	Qwen3-Reranker-0.6B
High-resource (EN/ZH/ES)	100%	100%	100%
Low-resource (SW/YO/AM/HA/IG)	100%	100%	100%

BEIR Benchmark (Short Documents)

On standard BEIR datasets with short documents (where BGE's 512-token limit is sufficient):

Dataset	Avg Tokens	mmbert MRR	BGE MRR
SciFact	279	94.9	95.6
NFCorpus	304	87.2	89.7
HotpotQA	61	100.0	100.0
FiQA	173	93.9	96.0

On short documents, BGE is slightly better due to its focused training on shorter contexts.

When to Use This Model

Scenario	Recommendation
Short docs (<512 tokens)	Use BGE (faster, slightly better)
Long docs (>512 tokens)	Use mmbert (BGE fails)
Documents with key info at end	Use mmbert (BGE truncates)
Low-resource languages	Use mmbert (1800+ languages)
Need speed/quality tradeoff	Use mmbert (2D Matryoshka)

Usage

Installation

pip install transformers torch

Basic Usage

import torch
from transformers import AutoTokenizer

# Load model (custom class required)
import sys
sys.path.append("path/to/scripts")
from train_rerank import Matryoshka2DReranker

model = Matryoshka2DReranker.from_pretrained("llm-semantic-router/mmbert-rerank-32k-2d-matryoshka")
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-rerank-32k-2d-matryoshka")

model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Score query-passage pairs
pairs = [
    ("What is machine learning?", "Machine learning is a subset of AI that enables systems to learn from data."),
    ("What is machine learning?", "The weather is sunny today."),
]

scores = model.compute_score(pairs, tokenizer, normalize=True)
# Output: [0.9686, 0.0017]

Using Reduced Layers/Dimensions (2D Matryoshka)

# Full model (22 layers, 768 dims) - highest quality
scores = model.compute_score(pairs, tokenizer, normalize=True)

# Reduced layers (use layer 11 for good balance)
scores = model.compute_score(pairs, tokenizer, layer_idx=11, normalize=True)

# Reduced dimensions
scores = model.compute_score(pairs, tokenizer, dim_idx=256, normalize=True)

# Both reduced (fastest)
scores = model.compute_score(pairs, tokenizer, layer_idx=11, dim_idx=256, normalize=True)

Reranking Pipeline

def rerank(query: str, passages: list[str], top_k: int = 10) -> list[tuple[str, float]]:
    """Rerank passages by relevance to query."""
    pairs = [(query, p) for p in passages]
    scores = model.compute_score(pairs, tokenizer, normalize=True)
    
    # Sort by score descending
    ranked = sorted(zip(passages, scores), key=lambda x: x[1], reverse=True)
    return ranked[:top_k]

# Example
query = "How does photosynthesis work?"
passages = [
    "Photosynthesis is the process by which plants convert sunlight into energy.",
    "The stock market closed higher today.",
    "Plants use chlorophyll to absorb light during photosynthesis.",
    "Python is a popular programming language.",
]

results = rerank(query, passages, top_k=2)
# [("Photosynthesis is the process...", 0.95), ("Plants use chlorophyll...", 0.87)]

Training Details

Training Data

Dataset: cfli/bge-m3-data
Samples: 1,569,864 query-passage pairs
Languages: 100+ languages from multilingual sources

Training Configuration

Parameter	Value
Base Model	llm-semantic-router/mmbert-32k-yarn
Epochs	1
Batch Size	16
Gradient Accumulation	2
Effective Batch Size	32
Learning Rate	2e-5
Max Length	32,768
Precision	BFloat16
Optimizer	AdamW
Hardware	AMD MI300X (196GB)
Training Time	6h 57m
Final Loss	0.3173

2D Matryoshka Configuration

Parameter	Value
Layer Indices	3, 6, 11, 22
Dimension Indices	64, 128, 256, 512, 768
Classification Heads	20 (4 layers × 5 dims)
Loss	Binary Cross-Entropy (averaged across all heads)

Comparison with Competitors

Attribute	BGE-reranker-v2-m3	Qwen3-Reranker-0.6B	mmbert-rerank-32k
Parameters	568M	600M	308M (smallest)
Context Length	512	8,192	32,768 (longest)
Architecture	Encoder	Decoder	Encoder (fast)
Long Doc Quality	❌ Fails	✅ Good	✅ Good
Short Doc Speed	Fastest	Slowest	Middle
2D Matryoshka	❌ No	❌ No	✅ Yes
Languages	100+	~100	1800+

Model Architecture

Matryoshka2DReranker
├── encoder: ModernBertModel (308M params)
│   └── 22 transformer layers
│   └── hidden_size: 768
│   └── YaRN position interpolation (32K context)
├── final_norm: LayerNorm (applied to intermediate layers)
└── layer_heads: ModuleDict
    ├── "3": {"64": Linear, "128": Linear, ..., "768": Linear}
    ├── "6": {"64": Linear, "128": Linear, ..., "768": Linear}
    ├── "11": {"64": Linear, "128": Linear, ..., "768": Linear}
    └── "22": {"64": Linear, "128": Linear, ..., "768": Linear}

Limitations

Training data: While the model supports 32K tokens, training was primarily on shorter passages from BGE-M3 data. Performance on very long passages relies on the base model's YaRN-extended context.
Domain-specific data: Model was trained on general web data; may require fine-tuning for specialized domains.
Layer reduction speedup: Due to the need for all hidden states, layer reduction provides minimal speedup without early-exit modifications.

License

Apache 2.0

Citation

@misc{mmbert-rerank-2025,
  title={mmbert-rerank-32k-2d-matryoshka: A Long-Context Multilingual Reranker with 2D Matryoshka},
  author={LLM Semantic Router Team},
  year={2025},
  url={https://huggingface.co/llm-semantic-router/mmbert-rerank-32k-2d-matryoshka}
}

Acknowledgments

ModernBERT for the base architecture
BGE-M3 for training data
Matryoshka Representation Learning for the dimensional reduction approach

Downloads last month: 48

Safetensors

Model size

0.3B params

Tensor type

BF16

Model tree for llm-semantic-router/mmbert-rerank-32k-2d-matryoshka

Base model

jhu-clsp/mmBERT-base

Quantized

llm-semantic-router/mmbert-32k-yarn

Finetuned

(1)

this model

Paper for llm-semantic-router/mmbert-rerank-32k-2d-matryoshka

Matryoshka Representation Learning

Paper • 2205.13147 • Published May 26, 2022 • 25