llama-embed-mamba2-7b
llama-embed-mamba2-7b is a general-purpose text embedding model that reproduces the e5-mistral-7b-instruct training procedure on top of Codestral Mamba2 7B, a recurrent (state space model) architecture. It achieves embedding quality competitive with e5-mistral-7b-instruct and similarly trained transformer-based models while offering linear-time computational complexity and constant-memory inference through a novel vertically chunked inference strategy. This makes it substantially more efficient for long sequences and suitable for memory-constrained environments.
A smaller variant is available as llama-embed-mamba2-1.3b.
Built with Llama. This model was partially trained on synthetic data generated with Llama 3.1 70B. See License for details.
Performance
Aggregated task-mean benchmark results compared to the original e5-mistral-7b-instruct. For detailed results, see our paper.
| Model | MTEB(Multilingual, v2) | MTEB(eng, v2) | LongEmbed |
|---|---|---|---|
| llama-embed-mamba2-1.3b | 55.2 | 64.3 | 40.8 |
| llama-embed-mamba2-7b (this model) | 59.4 | 65.2 | 44.5 |
| e5-mistral-7b-instruct | 60.2 | 68.0 | 43.7 |
Vertically Chunked Inference
This model supports vertically chunked inference, which achieves constant memory usage regardless of input sequence length. Instead of processing the full sequence through one layer at a time (standard horizontal inference), this approach:
- Partitions the input into fixed-size vertical chunks
- Processes each chunk through all model layers before advancing to the next
- Maintains the recurrent states across chunks, preserving full context
Parallelization benefits saturate at comparatively small chunk sizes, so processing in fixed-size blocks incurs negligible runtime overhead while dramatically reducing memory consumption. This makes the model particularly advantageous for long documents, resource-constrained environments, and large-batch processing where transformer memory requirements become prohibitive. Additional information can be found in our paper.
Usage
Requires
transformers>=5.5.0due to a breaking change to the cache of Mamba2 introduced inv5.5.0(transformers#44950).
To utilize the efficient kernel implementations enabling vertically chunked inference, the following libraries must be installed in addition to sentence-transformers/transformers:
pip install kernels einops
Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("dynatrace-oss/llama-embed-mamba2-7b", trust_remote_code=True)
queries = [
"Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: What is a state space model?",
]
documents = [
"A state space model (SSM) is a mathematical framework that describes a system using state variables.",
"Embedding models map text to dense vector representations that capture semantic meaning.",
]
query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
For long inputs, pass vertical_chunk_size to enable constant-memory inference. The value should be a multiple of 256 (the intra-layer chunk size). Recommended: largest value that fits in memory for single sequences, 256 or 512 for larger batches.
embeddings = model.encode(
["Your long document text here..."],
vertical_chunk_size=512,
)
Transformers
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dynatrace-oss/llama-embed-mamba2-7b")
model = AutoModel.from_pretrained(
"dynatrace-oss/llama-embed-mamba2-7b", trust_remote_code=True, torch_dtype=torch.bfloat16
).eval().cuda()
def encode(texts):
inputs = tokenizer(texts, return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state[:, -1, :]
return F.normalize(embeddings, p=2, dim=-1)
queries = [
"Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: What is a state space model?",
]
documents = [
"A state space model (SSM) is a mathematical framework that describes a system using state variables.",
"Embedding models map text to dense vector representations that capture semantic meaning.",
]
query_embeddings = encode(queries)
document_embeddings = encode(documents)
similarities = query_embeddings @ document_embeddings.T
print(similarities)
For long inputs, process the input in vertical chunks to maintain constant memory usage. The vertical_chunk_size must be a multiple of 256 (the intra-layer chunk size).
text = "Your long document text here..."
inputs = tokenizer(text, return_tensors="pt").to(model.device)
vertical_chunk_size = 512
cache_params = None
for chunk_start in range(0, inputs["input_ids"].size(1), vertical_chunk_size):
chunk_end = chunk_start + vertical_chunk_size
chunk_inputs = {
k: v[:, chunk_start:chunk_end]
for k, v in inputs.items() if k != "attention_mask"
}
chunk_inputs["attention_mask"] = inputs["attention_mask"][:, :chunk_end]
with torch.no_grad():
outputs = model(**chunk_inputs, use_cache=True, cache_params=cache_params)
cache_params = outputs.cache_params
embedding = outputs.last_hidden_state[:, -1, :]
embedding = F.normalize(embedding, p=2, dim=-1)
Open Source Integration Roadmap
Our goal is to integrate all necessary changes to simplify the adoption of vertically chunked inference for other models:
⚪ Planned | 🟡 In Progress | 🟢 Integrated
- ⚪ causal-conv1d: Enable simultaneous
seq_idx+initial_states(required for recurrent processing with left padding) - ⚪ mamba-ssm: Use
seq_idx+initial_statesinmamba_split_conv1d_scan_combinedand export final states - ⚪ kernels-community: Propagate changes in
causal-conv1dandmamba-ssmto their kernel hub equivalents in thekernels-communityrepositories - ⚪ transformers: Use updated
mamba_split_conv1d_scan_combinedwith cache params during inference (currently only used during training, not configurable, problems with left padding) - ⚪ sentence-transformers: Native vertical chunking support
This list will be updated as integration progresses.
Code Repositories
The custom model and inference code used by this model is maintained in two separate repositories so that it can be shared between model checkpoints:
- dynatrace-oss/chunkable-mamba2: Adjusted Mamba2 kernel and model & configuration classes with recurrent chunked inference support.
- dynatrace-oss/chunkable-sentence-transformer: Sentence Transformers modules (
ChunkableTransformer,LastIndexPooling) enabling constant-memory encoding.
License
This model is licensed under the Apache License 2.0. The training dataset (open-synthetic-embeddings) is licensed under MIT.
Llama 3.1 Community License: This model was partially trained on synthetic data generated with Llama 3.1 70B. Use of this model is subject to the Llama 3.1 Community License Agreement and the Llama 3.1 Acceptable Use Policy. A copy of the Llama 3.1 license is included in the LLAMA_LICENSE file. Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.
Usage Considerations
This model is intended for the generation of text embeddings for retrieval, clustering, classification, and similar tasks.
Embedding models can reflect biases present in their pretraining and fine-tuning data. Users should evaluate the model for fairness in their specific use case and combine model outputs with human verification for applications where embedding quality could materially impact individuals or decisions. This model must not be used for any application prohibited by the Llama 3.1 Acceptable Use Policy or in any manner that violates applicable laws and regulations.
Citation
If you find this model useful, consider citing our paper:
@article{grantner2026linear,
title={Linear-Time and Constant-Memory Text Embeddings Based on Recurrent Language Models},
author={Tobias Grantner and Emanuel Sallinger and Martin Flechl},
year={2026},
eprint={2604.18199},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2604.18199},
}
- Downloads last month
- 10
Model tree for dynatrace-oss/llama-embed-mamba2-7b
Base model
mistralai/Mamba-Codestral-7B-v0.1