bge-small-rrf-lme-v1: chat-memory specialist embedding (33M params, 384-dim)
A drop-in replacement for BAAI/bge-small-en-v1.5 specialized
for chat-memory retrieval -- finding the right past conversation
session given a user's natural-language question about their own
history. Same size, same speed, same API; higher quality on
conversational long-term memory.
If you are building:
- a long-running chat assistant that needs to recall facts across sessions ("what did the user tell me last week about X?"),
- an LLM agent's memory layer over conversational history,
- a RAG pipeline over a user's own past chats / notes / journal,
- a personal memory tool that retrieves prior interactions,
this model is tuned for that distribution. General-purpose BEIR
retrieval (claim verification, biomedical QA, etc.) should use
Stffens/bge-small-rrf-v3
instead.
Trained through vstash's eval-gated labeled-retrain mode using 398 real LongMemEval question/gold-session pairs (no synthetic queries, no LLM labelers, no human annotators added on top of the LongMemEval gold).
Looking for a model of this kind? Quick router
| Your use case | Recommended model |
|---|---|
| Chat memory / conversation recall | this model (Stffens/bge-small-rrf-lme-v1) |
| BEIR / scientific / general-purpose retrieval | Stffens/bge-small-rrf-v3 |
| Vanilla starting point, no domain adaptation | BAAI/bge-small-en-v1.5 |
| Your own domain (legal, clinical, code, ...) | Train your own via vstash retrain -- this model is the worked example, the gated retrain pipeline is the deployable artifact |
Why this model exists
vstash's BEIR fine-tunes
(bge-small-rrf-v2
/ bge-small-rrf-v3)
beat vanilla BGE-small on every BEIR dataset but lose to
vanilla on chat-memory retrieval (LongMemEval-s holdout NDCG@10
0.5898 vs 0.6143, -2.45pp). A model that wins on BEIR does not
necessarily win on chat memory. This model exists to recover --
and exceed -- vanilla BGE-small on the chat-memory distribution
that the BEIR fine-tunes regress on.
Eval numbers
Evaluated on a deterministic 80/20 stratified split of LongMemEval-s
(seed 42, experiments/lme_prepare_retrain.py:stratified_split),
with vstash's hybrid retrieval pipeline (RRF + adaptive IDF + MMR
dedup). All headline claims rest on the 102-query disjoint
holdout.
NDCG@10 on the holdout (eval-gate metric)
| Model | NDCG@10 | Delta vs vanilla |
|---|---|---|
| BAAI/bge-small-en-v1.5 (vanilla) | 0.6143 | -- |
| Stffens/bge-small-rrf-v2 (BEIR-tuned) | 0.5898 | -0.0245 |
| bge-small-rrf-lme-v1 (this model) | 0.6878 | +0.0735 |
The chat fine-tune lifts NDCG@10 by 11.85% relative (+7.35pp absolute) over vanilla. The eval gate would have correctly rejected the BEIR-tuned v2 (regression) and auto-promoted this model.
Recall@K with bootstrap confidence intervals
Paired bootstrap, B=1000 resamples with replacement, seed 42, n=102 holdout queries.
| K | base BGE | this model | Delta | 95% CI |
|---|---|---|---|---|
| 1 | 0.5209 | 0.5552 | +0.0343 | [-0.0049, +0.0784] directional |
| 3 | 0.8418 | 0.8716 | +0.0297 | [+0.0085, +0.0529] |
| 5 | 0.8905 | 0.9284 | +0.0379 | [+0.0172, +0.0619] |
| 10 | 0.9634 | 0.9658 | +0.0025 | [-0.0123, +0.0172] saturated |
R@3 and R@5 lifts are statistically significant on the n=102 holdout. R@1 trends positive but the CI narrowly crosses zero; we report it as directional rather than significant. R@10 is saturated (both arms already retrieve the gold session in the top-10 on ~96% of queries).
Per question-type R@5 on the same holdout
| Type | n | base | this model |
|---|---|---|---|
| single-session-user | 14 | 0.929 | 0.929 (saturated) |
| single-session-assistant | 12 | 1.000 | 1.000 |
| single-session-preference | 6 | 1.000 | 1.000 |
| multi-session | 27 | 0.869 | 0.938 (+6.9pp) |
| knowledge-update | 16 | 0.969 | 1.000 (+3.1pp) |
| temporal-reasoning | 27 | 0.774 | 0.829 (+5.6pp) |
Gains concentrate on multi-session and temporal-reasoning -- the two categories where queries reference cross-session entities and dates that surface form alone does not match.
Latency
Measured on a 2024 Apple Silicon Mac with FastEmbed CPU inference, per-question 50-doc store: 27 ms median, 70 ms P99 search latency. No latency regression vs vanilla BGE-small (21 ms median; the +6 ms is within run-to-run noise).
Usage
Drop-in via sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Stffens/bge-small-rrf-lme-v1")
# Encode the user's question and a candidate past session
embeddings = model.encode(
[
"What did I tell you about my flight to Tokyo last week?",
"[USER] By the way, my flight to Tokyo got pushed to next Tuesday "
"so I can finally make the team dinner. [ASSISTANT] Glad to hear "
"that worked out! Tuesday it is.",
],
normalize_embeddings=True,
)
similarity = embeddings[0] @ embeddings[1]
As a chat-memory backbone in vstash
vstash reindex --model Stffens/bge-small-rrf-lme-v1
Inside any RAG / retrieval stack
Same API as bge-small-en-v1.5: 384 dimensions, cosine similarity,
instruction-free encoding, normalize before cosine. Drop into any
retrieval stack built around the base model.
Reproducing this fine-tune
End-to-end reproduction (corpus prep ~9 min Mac, training ~12 min Colab T4, full-set scoring ~9 min Mac):
# 1. Stratified split + ingest LongMemEval-s into a vstash store
python -m experiments.lme_prepare_retrain \
--output-db experiments/lme_retrain_full.db \
--output-train experiments/results/lme_train.jsonl \
--output-eval experiments/results/lme_eval.jsonl \
--output-meta experiments/results/lme_retrain_meta.json
# 2. Train via the eval-gated labeled retrain mode (Colab T4)
VSTASH_DB_PATH=experiments/lme_retrain_full.db \
vstash retrain \
--training-queries experiments/results/lme_train.jsonl \
--eval-queries experiments/results/lme_eval.jsonl \
--base-model BAAI/bge-small-en-v1.5 \
--output ~/.vstash/models/bge-small-rrf-lme-v1 \
--bulk-mine --bulk-mine-device cuda
# 3. Score on the full LongMemEval-s set (Mac local)
python -m experiments.longmemeval_retrieval --all \
--model ~/.vstash/models/bge-small-rrf-lme-v1 \
--output experiments/results/lme_full_500_lme-v1.json
# 4. Bootstrap holdout CIs
python -m experiments.lme_holdout_bootstrap
Hyperparameters
| Key | Value |
|---|---|
| Base model | BAAI/bge-small-en-v1.5 |
| Loss | MultipleNegativesRankingLoss |
| Training triples | 5,000 (capped, mined from 398 labeled queries) |
| Epochs | 2 |
| Learning rate | 3e-6 |
| Batch size | 64 |
| Mixed precision | FP16 (AMP on) |
| Seed | 42 |
| Training hardware | NVIDIA T4 (Colab) |
| Training time | ~77 s (12 min wall incl. mining) |
| Eval gate | refuse-to-save if delta NDCG@10 < --min-gain (default 0.0) |
Limitations
- English-only. The base model and training corpus are English. Cross-lingual retrieval is not validated.
- Specialist model. Do not deploy this on BEIR-style retrieval
workloads; it was selected on chat-memory NDCG, not generic
retrieval quality. Use
bge-small-rrf-v3for BEIR. - R@1 ceiling. Macro holdout R@1 is 73% of the structural ceiling 0.75 (multi-gold question types in LongMemEval cap macro R@1 by construction). Closing the remaining gap likely requires a cross-encoder reranker over top-10, not further embedding refinement.
- R@1 lift is directional, not statistically significant at 95% on the n=102 holdout (CI [-0.005, +0.078] crosses zero). The headline significance-grade lifts are R@3 and R@5.
- Single-benchmark scope. All chat-memory eval is on LongMemEval-s. Generalization to LoCoMo, MultiSession-Chat, or other chat-memory benchmarks is not validated. If you have a different chat-memory corpus, the deployable artifact is the eval-gated retrain pipeline, not this exact model: train your own specialist on your own labels.
Keywords
Chat memory embedding, conversational memory retrieval, agent memory model, LLM long-term memory, multi-session conversation retrieval, chat history embedding, conversational RAG embedding, LongMemEval embedding, BGE chat-memory fine-tune, domain-adapted embedding for chat assistants, eval-gated retrieval fine-tune.
Citation
@software{vstash_bge_small_rrf_lme_v1_2026,
author = {Steffens, Jay},
title = {bge-small-rrf-lme-v1: chat-memory specialist embedding via vstash's eval-gated labeled retrain},
year = {2026},
url = {https://huggingface.co/Stffens/bge-small-rrf-lme-v1}
}
For the underlying paper:
@misc{vstash_v2_2026,
author = {Steffens, Jay},
title = {vstash v2: Eval-Gated Domain Adaptation for Local-First LLM Memory},
year = {2026},
note = {arXiv preprint, paper v2 of arXiv:2604.15484}
}
For LongMemEval (the evaluation benchmark used here):
@article{wu2024longmemeval,
title = {LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory},
author = {Wu, Di and Wang, Hongwei and Yu, Wenhao and Zhang, Yunsheng and Chang, Kai-Wei and Yu, Dong},
journal= {arXiv preprint arXiv:2410.10813},
year = {2024}
}
For vstash itself:
@software{vstash_2026,
author = {Steffens, Jay},
title = {vstash: local-first document memory with instant semantic search},
year = {2026},
url = {https://github.com/stffns/vstash}
}
- Downloads last month
- 41
Model tree for Stffens/bge-small-rrf-lme-v1
Base model
BAAI/bge-small-en-v1.5