llama-embed-mamba2-7b

llama-embed-mamba2-7b is a general-purpose text embedding model that reproduces the e5-mistral-7b-instruct training procedure on top of Codestral Mamba2 7B, a recurrent (state space model) architecture. It achieves embedding quality competitive with e5-mistral-7b-instruct and similarly trained transformer-based models while offering linear-time computational complexity and constant-memory inference through a novel vertically chunked inference strategy. This makes it substantially more efficient for long sequences and suitable for memory-constrained environments.

A smaller variant is available as llama-embed-mamba2-1.3b.

Built with Llama. This model was partially trained on synthetic data generated with Llama 3.1 70B. See License for details.

Performance

Aggregated task-mean benchmark results compared to the original e5-mistral-7b-instruct. For detailed results, see our paper.

Model	MTEB(Multilingual, v2)	MTEB(eng, v2)	LongEmbed
llama-embed-mamba2-1.3b	55.2	64.3	40.8
llama-embed-mamba2-7b (this model)	59.4	65.2	44.5
e5-mistral-7b-instruct	60.2	68.0	43.7

Vertically Chunked Inference

This model supports vertically chunked inference, which achieves constant memory usage regardless of input sequence length. Instead of processing the full sequence through one layer at a time (standard horizontal inference), this approach:

Partitions the input into fixed-size vertical chunks
Processes each chunk through all model layers before advancing to the next
Maintains the recurrent states across chunks, preserving full context

Parallelization benefits saturate at comparatively small chunk sizes, so processing in fixed-size blocks incurs negligible runtime overhead while dramatically reducing memory consumption. This makes the model particularly advantageous for long documents, resource-constrained environments, and large-batch processing where transformer memory requirements become prohibitive. Additional information can be found in our paper.

Usage

Requires transformers>=5.5.0 due to a breaking change to the cache of Mamba2 introduced in v5.5.0 (transformers#44950).

To utilize the efficient kernel implementations enabling vertically chunked inference, the following libraries must be installed in addition to sentence-transformers/transformers:

pip install kernels einops

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("dynatrace-oss/llama-embed-mamba2-7b", trust_remote_code=True)

queries = [
    "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: What is a state space model?",
]
documents = [
    "A state space model (SSM) is a mathematical framework that describes a system using state variables.",
    "Embedding models map text to dense vector representations that capture semantic meaning.",
]

query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)

similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)

For long inputs, pass vertical_chunk_size to enable constant-memory inference. The value should be a multiple of 256 (the intra-layer chunk size). Recommended: largest value that fits in memory for single sequences, 256 or 512 for larger batches.

embeddings = model.encode(
    ["Your long document text here..."],
    vertical_chunk_size=512,
)

Transformers

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dynatrace-oss/llama-embed-mamba2-7b")
model = AutoModel.from_pretrained(
    "dynatrace-oss/llama-embed-mamba2-7b", trust_remote_code=True, torch_dtype=torch.bfloat16
).eval().cuda()

def encode(texts):
    inputs = tokenizer(texts, return_tensors="pt", padding=True).to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:, -1, :]
    return F.normalize(embeddings, p=2, dim=-1)

queries = [
    "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: What is a state space model?",
]
documents = [
    "A state space model (SSM) is a mathematical framework that describes a system using state variables.",
    "Embedding models map text to dense vector representations that capture semantic meaning.",
]

query_embeddings = encode(queries)
document_embeddings = encode(documents)

similarities = query_embeddings @ document_embeddings.T
print(similarities)

For long inputs, process the input in vertical chunks to maintain constant memory usage. The vertical_chunk_size must be a multiple of 256 (the intra-layer chunk size).

text = "Your long document text here..."
inputs = tokenizer(text, return_tensors="pt").to(model.device)

vertical_chunk_size = 512
cache_params = None
for chunk_start in range(0, inputs["input_ids"].size(1), vertical_chunk_size):
    chunk_end = chunk_start + vertical_chunk_size
    chunk_inputs = {
        k: v[:, chunk_start:chunk_end]
        for k, v in inputs.items() if k != "attention_mask"
    }
    chunk_inputs["attention_mask"] = inputs["attention_mask"][:, :chunk_end]
    with torch.no_grad():
        outputs = model(**chunk_inputs, use_cache=True, cache_params=cache_params)
    cache_params = outputs.cache_params

embedding = outputs.last_hidden_state[:, -1, :]
embedding = F.normalize(embedding, p=2, dim=-1)

Open Source Integration Roadmap

Our goal is to integrate all necessary changes to simplify the adoption of vertically chunked inference for other models:

⚪ Planned | 🟡 In Progress | 🟢 Integrated

⚪ causal-conv1d: Enable simultaneous seq_idx + initial_states (required for recurrent processing with left padding)
⚪ mamba-ssm: Use seq_idx + initial_states in mamba_split_conv1d_scan_combined and export final states
⚪ kernels-community: Propagate changes in causal-conv1d and mamba-ssm to their kernel hub equivalents in the kernels-community repositories
⚪ transformers: Use updated mamba_split_conv1d_scan_combined with cache params during inference (currently only used during training, not configurable, problems with left padding)
⚪ sentence-transformers: Native vertical chunking support

This list will be updated as integration progresses.

Code Repositories

The custom model and inference code used by this model is maintained in two separate repositories so that it can be shared between model checkpoints:

dynatrace-oss/chunkable-mamba2: Adjusted Mamba2 kernel and model & configuration classes with recurrent chunked inference support.
dynatrace-oss/chunkable-sentence-transformer: Sentence Transformers modules (ChunkableTransformer, LastIndexPooling) enabling constant-memory encoding.

License

This model is licensed under the Apache License 2.0. The training dataset (open-synthetic-embeddings) is licensed under MIT.

Llama 3.1 Community License: This model was partially trained on synthetic data generated with Llama 3.1 70B. Use of this model is subject to the Llama 3.1 Community License Agreement and the Llama 3.1 Acceptable Use Policy. A copy of the Llama 3.1 license is included in the LLAMA_LICENSE file. Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.

Usage Considerations

This model is intended for the generation of text embeddings for retrieval, clustering, classification, and similar tasks.

Embedding models can reflect biases present in their pretraining and fine-tuning data. Users should evaluate the model for fairness in their specific use case and combine model outputs with human verification for applications where embedding quality could materially impact individuals or decisions. This model must not be used for any application prohibited by the Llama 3.1 Acceptable Use Policy or in any manner that violates applicable laws and regulations.

Citation

If you find this model useful, consider citing our paper:

@article{grantner2026linear,
    title={Linear-Time and Constant-Memory Text Embeddings Based on Recurrent Language Models}, 
    author={Tobias Grantner and Emanuel Sallinger and Martin Flechl},
    year={2026},
    eprint={2604.18199},
    archivePrefix={arXiv},
    url={https://arxiv.org/abs/2604.18199},
}