bge-small-rrf-lme-v1: chat-memory specialist embedding (33M params, 384-dim)

A drop-in replacement for BAAI/bge-small-en-v1.5 specialized for chat-memory retrieval -- finding the right past conversation session given a user's natural-language question about their own history. Same size, same speed, same API; higher quality on conversational long-term memory.

If you are building:

a long-running chat assistant that needs to recall facts across sessions ("what did the user tell me last week about X?"),
an LLM agent's memory layer over conversational history,
a RAG pipeline over a user's own past chats / notes / journal,
a personal memory tool that retrieves prior interactions,

this model is tuned for that distribution. General-purpose BEIR retrieval (claim verification, biomedical QA, etc.) should use Stffens/bge-small-rrf-v3 instead.

Trained through vstash's eval-gated labeled-retrain mode using 398 real LongMemEval question/gold-session pairs (no synthetic queries, no LLM labelers, no human annotators added on top of the LongMemEval gold).

Looking for a model of this kind? Quick router

Your use case	Recommended model
Chat memory / conversation recall	this model (`Stffens/bge-small-rrf-lme-v1`)
BEIR / scientific / general-purpose retrieval	`Stffens/bge-small-rrf-v3`
Vanilla starting point, no domain adaptation	`BAAI/bge-small-en-v1.5`
Your own domain (legal, clinical, code, ...)	Train your own via vstash retrain -- this model is the worked example, the gated retrain pipeline is the deployable artifact

Why this model exists

vstash's BEIR fine-tunes (bge-small-rrf-v2 / bge-small-rrf-v3) beat vanilla BGE-small on every BEIR dataset but lose to vanilla on chat-memory retrieval (LongMemEval-s holdout NDCG@10 0.5898 vs 0.6143, -2.45pp). A model that wins on BEIR does not necessarily win on chat memory. This model exists to recover -- and exceed -- vanilla BGE-small on the chat-memory distribution that the BEIR fine-tunes regress on.

Eval numbers

Evaluated on a deterministic 80/20 stratified split of LongMemEval-s (seed 42, experiments/lme_prepare_retrain.py:stratified_split), with vstash's hybrid retrieval pipeline (RRF + adaptive IDF + MMR dedup). All headline claims rest on the 102-query disjoint holdout.

NDCG@10 on the holdout (eval-gate metric)

Model	NDCG@10	Delta vs vanilla
BAAI/bge-small-en-v1.5 (vanilla)	0.6143	--
Stffens/bge-small-rrf-v2 (BEIR-tuned)	0.5898	-0.0245
bge-small-rrf-lme-v1 (this model)	0.6878	+0.0735

The chat fine-tune lifts NDCG@10 by 11.85% relative (+7.35pp absolute) over vanilla. The eval gate would have correctly rejected the BEIR-tuned v2 (regression) and auto-promoted this model.

Recall@K with bootstrap confidence intervals

Paired bootstrap, B=1000 resamples with replacement, seed 42, n=102 holdout queries.

K	base BGE	this model	Delta	95% CI
1	0.5209	0.5552	+0.0343	[-0.0049, +0.0784] directional
3	0.8418	0.8716	+0.0297	[+0.0085, +0.0529]
5	0.8905	0.9284	+0.0379	[+0.0172, +0.0619]
10	0.9634	0.9658	+0.0025	[-0.0123, +0.0172] saturated

R@3 and R@5 lifts are statistically significant on the n=102 holdout. R@1 trends positive but the CI narrowly crosses zero; we report it as directional rather than significant. R@10 is saturated (both arms already retrieve the gold session in the top-10 on ~96% of queries).

Per question-type R@5 on the same holdout

Type	n	base	this model
single-session-user	14	0.929	0.929 (saturated)
single-session-assistant	12	1.000	1.000
single-session-preference	6	1.000	1.000
multi-session	27	0.869	0.938 (+6.9pp)
knowledge-update	16	0.969	1.000 (+3.1pp)
temporal-reasoning	27	0.774	0.829 (+5.6pp)

Gains concentrate on multi-session and temporal-reasoning -- the two categories where queries reference cross-session entities and dates that surface form alone does not match.

Latency

Measured on a 2024 Apple Silicon Mac with FastEmbed CPU inference, per-question 50-doc store: 27 ms median, 70 ms P99 search latency. No latency regression vs vanilla BGE-small (21 ms median; the +6 ms is within run-to-run noise).

Usage

Drop-in via sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Stffens/bge-small-rrf-lme-v1")

# Encode the user's question and a candidate past session
embeddings = model.encode(
    [
        "What did I tell you about my flight to Tokyo last week?",
        "[USER] By the way, my flight to Tokyo got pushed to next Tuesday "
        "so I can finally make the team dinner. [ASSISTANT] Glad to hear "
        "that worked out! Tuesday it is.",
    ],
    normalize_embeddings=True,
)
similarity = embeddings[0] @ embeddings[1]

As a chat-memory backbone in vstash

vstash reindex --model Stffens/bge-small-rrf-lme-v1

Inside any RAG / retrieval stack

Same API as bge-small-en-v1.5: 384 dimensions, cosine similarity, instruction-free encoding, normalize before cosine. Drop into any retrieval stack built around the base model.

Reproducing this fine-tune

End-to-end reproduction (corpus prep ~9 min Mac, training ~12 min Colab T4, full-set scoring ~9 min Mac):

# 1. Stratified split + ingest LongMemEval-s into a vstash store
python -m experiments.lme_prepare_retrain \
    --output-db    experiments/lme_retrain_full.db \
    --output-train experiments/results/lme_train.jsonl \
    --output-eval  experiments/results/lme_eval.jsonl \
    --output-meta  experiments/results/lme_retrain_meta.json

# 2. Train via the eval-gated labeled retrain mode (Colab T4)
VSTASH_DB_PATH=experiments/lme_retrain_full.db \
vstash retrain \
    --training-queries experiments/results/lme_train.jsonl \
    --eval-queries     experiments/results/lme_eval.jsonl \
    --base-model       BAAI/bge-small-en-v1.5 \
    --output           ~/.vstash/models/bge-small-rrf-lme-v1 \
    --bulk-mine --bulk-mine-device cuda

# 3. Score on the full LongMemEval-s set (Mac local)
python -m experiments.longmemeval_retrieval --all \
    --model ~/.vstash/models/bge-small-rrf-lme-v1 \
    --output experiments/results/lme_full_500_lme-v1.json

# 4. Bootstrap holdout CIs
python -m experiments.lme_holdout_bootstrap

Hyperparameters

Key	Value
Base model	`BAAI/bge-small-en-v1.5`
Loss	`MultipleNegativesRankingLoss`
Training triples	5,000 (capped, mined from 398 labeled queries)
Epochs	2
Learning rate	3e-6
Batch size	64
Mixed precision	FP16 (AMP on)
Seed	42
Training hardware	NVIDIA T4 (Colab)
Training time	~77 s (12 min wall incl. mining)
Eval gate	refuse-to-save if delta NDCG@10 < `--min-gain` (default 0.0)

Limitations

English-only. The base model and training corpus are English. Cross-lingual retrieval is not validated.
Specialist model. Do not deploy this on BEIR-style retrieval workloads; it was selected on chat-memory NDCG, not generic retrieval quality. Use bge-small-rrf-v3 for BEIR.
R@1 ceiling. Macro holdout R@1 is 73% of the structural ceiling 0.75 (multi-gold question types in LongMemEval cap macro R@1 by construction). Closing the remaining gap likely requires a cross-encoder reranker over top-10, not further embedding refinement.
R@1 lift is directional, not statistically significant at 95% on the n=102 holdout (CI [-0.005, +0.078] crosses zero). The headline significance-grade lifts are R@3 and R@5.
Single-benchmark scope. All chat-memory eval is on LongMemEval-s. Generalization to LoCoMo, MultiSession-Chat, or other chat-memory benchmarks is not validated. If you have a different chat-memory corpus, the deployable artifact is the eval-gated retrain pipeline, not this exact model: train your own specialist on your own labels.

Keywords

Chat memory embedding, conversational memory retrieval, agent memory model, LLM long-term memory, multi-session conversation retrieval, chat history embedding, conversational RAG embedding, LongMemEval embedding, BGE chat-memory fine-tune, domain-adapted embedding for chat assistants, eval-gated retrieval fine-tune.

Citation

@software{vstash_bge_small_rrf_lme_v1_2026,
  author  = {Steffens, Jay},
  title   = {bge-small-rrf-lme-v1: chat-memory specialist embedding via vstash's eval-gated labeled retrain},
  year    = {2026},
  url     = {https://huggingface.co/Stffens/bge-small-rrf-lme-v1}
}

For the underlying paper:

@misc{vstash_v2_2026,
  author  = {Steffens, Jay},
  title   = {vstash v2: Eval-Gated Domain Adaptation for Local-First LLM Memory},
  year    = {2026},
  note    = {arXiv preprint, paper v2 of arXiv:2604.15484}
}

For LongMemEval (the evaluation benchmark used here):

@article{wu2024longmemeval,
  title  = {LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory},
  author = {Wu, Di and Wang, Hongwei and Yu, Wenhao and Zhang, Yunsheng and Chang, Kai-Wei and Yu, Dong},
  journal= {arXiv preprint arXiv:2410.10813},
  year   = {2024}
}

For vstash itself:

@software{vstash_2026,
  author  = {Steffens, Jay},
  title   = {vstash: local-first document memory with instant semantic search},
  year    = {2026},
  url     = {https://github.com/stffns/vstash}
}

Downloads last month: 41

Safetensors

Model size

33.4M params

Tensor type

F32

Model tree for Stffens/bge-small-rrf-lme-v1

Base model

BAAI/bge-small-en-v1.5

Finetuned

(354)

this model

Dataset used to train Stffens/bge-small-rrf-lme-v1

Papers for Stffens/bge-small-rrf-lme-v1

vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents

Paper • 2604.15484 • Published 18 days ago

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Paper • 2410.10813 • Published Oct 14, 2024 • 16