bge-small-rrf-lme-v1: chat-memory specialist embedding (33M params, 384-dim)

A drop-in replacement for BAAI/bge-small-en-v1.5 specialized for chat-memory retrieval -- finding the right past conversation session given a user's natural-language question about their own history. Same size, same speed, same API; higher quality on conversational long-term memory.

If you are building:

  • a long-running chat assistant that needs to recall facts across sessions ("what did the user tell me last week about X?"),
  • an LLM agent's memory layer over conversational history,
  • a RAG pipeline over a user's own past chats / notes / journal,
  • a personal memory tool that retrieves prior interactions,

this model is tuned for that distribution. General-purpose BEIR retrieval (claim verification, biomedical QA, etc.) should use Stffens/bge-small-rrf-v3 instead.

Trained through vstash's eval-gated labeled-retrain mode using 398 real LongMemEval question/gold-session pairs (no synthetic queries, no LLM labelers, no human annotators added on top of the LongMemEval gold).

Looking for a model of this kind? Quick router

Your use case Recommended model
Chat memory / conversation recall this model (Stffens/bge-small-rrf-lme-v1)
BEIR / scientific / general-purpose retrieval Stffens/bge-small-rrf-v3
Vanilla starting point, no domain adaptation BAAI/bge-small-en-v1.5
Your own domain (legal, clinical, code, ...) Train your own via vstash retrain -- this model is the worked example, the gated retrain pipeline is the deployable artifact

Why this model exists

vstash's BEIR fine-tunes (bge-small-rrf-v2 / bge-small-rrf-v3) beat vanilla BGE-small on every BEIR dataset but lose to vanilla on chat-memory retrieval (LongMemEval-s holdout NDCG@10 0.5898 vs 0.6143, -2.45pp). A model that wins on BEIR does not necessarily win on chat memory. This model exists to recover -- and exceed -- vanilla BGE-small on the chat-memory distribution that the BEIR fine-tunes regress on.

Eval numbers

Evaluated on a deterministic 80/20 stratified split of LongMemEval-s (seed 42, experiments/lme_prepare_retrain.py:stratified_split), with vstash's hybrid retrieval pipeline (RRF + adaptive IDF + MMR dedup). All headline claims rest on the 102-query disjoint holdout.

NDCG@10 on the holdout (eval-gate metric)

Model NDCG@10 Delta vs vanilla
BAAI/bge-small-en-v1.5 (vanilla) 0.6143 --
Stffens/bge-small-rrf-v2 (BEIR-tuned) 0.5898 -0.0245
bge-small-rrf-lme-v1 (this model) 0.6878 +0.0735

The chat fine-tune lifts NDCG@10 by 11.85% relative (+7.35pp absolute) over vanilla. The eval gate would have correctly rejected the BEIR-tuned v2 (regression) and auto-promoted this model.

Recall@K with bootstrap confidence intervals

Paired bootstrap, B=1000 resamples with replacement, seed 42, n=102 holdout queries.

K base BGE this model Delta 95% CI
1 0.5209 0.5552 +0.0343 [-0.0049, +0.0784] directional
3 0.8418 0.8716 +0.0297 [+0.0085, +0.0529]
5 0.8905 0.9284 +0.0379 [+0.0172, +0.0619]
10 0.9634 0.9658 +0.0025 [-0.0123, +0.0172] saturated

R@3 and R@5 lifts are statistically significant on the n=102 holdout. R@1 trends positive but the CI narrowly crosses zero; we report it as directional rather than significant. R@10 is saturated (both arms already retrieve the gold session in the top-10 on ~96% of queries).

Per question-type R@5 on the same holdout

Type n base this model
single-session-user 14 0.929 0.929 (saturated)
single-session-assistant 12 1.000 1.000
single-session-preference 6 1.000 1.000
multi-session 27 0.869 0.938 (+6.9pp)
knowledge-update 16 0.969 1.000 (+3.1pp)
temporal-reasoning 27 0.774 0.829 (+5.6pp)

Gains concentrate on multi-session and temporal-reasoning -- the two categories where queries reference cross-session entities and dates that surface form alone does not match.

Latency

Measured on a 2024 Apple Silicon Mac with FastEmbed CPU inference, per-question 50-doc store: 27 ms median, 70 ms P99 search latency. No latency regression vs vanilla BGE-small (21 ms median; the +6 ms is within run-to-run noise).

Usage

Drop-in via sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Stffens/bge-small-rrf-lme-v1")

# Encode the user's question and a candidate past session
embeddings = model.encode(
    [
        "What did I tell you about my flight to Tokyo last week?",
        "[USER] By the way, my flight to Tokyo got pushed to next Tuesday "
        "so I can finally make the team dinner. [ASSISTANT] Glad to hear "
        "that worked out! Tuesday it is.",
    ],
    normalize_embeddings=True,
)
similarity = embeddings[0] @ embeddings[1]

As a chat-memory backbone in vstash

vstash reindex --model Stffens/bge-small-rrf-lme-v1

Inside any RAG / retrieval stack

Same API as bge-small-en-v1.5: 384 dimensions, cosine similarity, instruction-free encoding, normalize before cosine. Drop into any retrieval stack built around the base model.

Reproducing this fine-tune

End-to-end reproduction (corpus prep ~9 min Mac, training ~12 min Colab T4, full-set scoring ~9 min Mac):

# 1. Stratified split + ingest LongMemEval-s into a vstash store
python -m experiments.lme_prepare_retrain \
    --output-db    experiments/lme_retrain_full.db \
    --output-train experiments/results/lme_train.jsonl \
    --output-eval  experiments/results/lme_eval.jsonl \
    --output-meta  experiments/results/lme_retrain_meta.json

# 2. Train via the eval-gated labeled retrain mode (Colab T4)
VSTASH_DB_PATH=experiments/lme_retrain_full.db \
vstash retrain \
    --training-queries experiments/results/lme_train.jsonl \
    --eval-queries     experiments/results/lme_eval.jsonl \
    --base-model       BAAI/bge-small-en-v1.5 \
    --output           ~/.vstash/models/bge-small-rrf-lme-v1 \
    --bulk-mine --bulk-mine-device cuda

# 3. Score on the full LongMemEval-s set (Mac local)
python -m experiments.longmemeval_retrieval --all \
    --model ~/.vstash/models/bge-small-rrf-lme-v1 \
    --output experiments/results/lme_full_500_lme-v1.json

# 4. Bootstrap holdout CIs
python -m experiments.lme_holdout_bootstrap

Hyperparameters

Key Value
Base model BAAI/bge-small-en-v1.5
Loss MultipleNegativesRankingLoss
Training triples 5,000 (capped, mined from 398 labeled queries)
Epochs 2
Learning rate 3e-6
Batch size 64
Mixed precision FP16 (AMP on)
Seed 42
Training hardware NVIDIA T4 (Colab)
Training time ~77 s (12 min wall incl. mining)
Eval gate refuse-to-save if delta NDCG@10 < --min-gain (default 0.0)

Limitations

  • English-only. The base model and training corpus are English. Cross-lingual retrieval is not validated.
  • Specialist model. Do not deploy this on BEIR-style retrieval workloads; it was selected on chat-memory NDCG, not generic retrieval quality. Use bge-small-rrf-v3 for BEIR.
  • R@1 ceiling. Macro holdout R@1 is 73% of the structural ceiling 0.75 (multi-gold question types in LongMemEval cap macro R@1 by construction). Closing the remaining gap likely requires a cross-encoder reranker over top-10, not further embedding refinement.
  • R@1 lift is directional, not statistically significant at 95% on the n=102 holdout (CI [-0.005, +0.078] crosses zero). The headline significance-grade lifts are R@3 and R@5.
  • Single-benchmark scope. All chat-memory eval is on LongMemEval-s. Generalization to LoCoMo, MultiSession-Chat, or other chat-memory benchmarks is not validated. If you have a different chat-memory corpus, the deployable artifact is the eval-gated retrain pipeline, not this exact model: train your own specialist on your own labels.

Keywords

Chat memory embedding, conversational memory retrieval, agent memory model, LLM long-term memory, multi-session conversation retrieval, chat history embedding, conversational RAG embedding, LongMemEval embedding, BGE chat-memory fine-tune, domain-adapted embedding for chat assistants, eval-gated retrieval fine-tune.

Citation

@software{vstash_bge_small_rrf_lme_v1_2026,
  author  = {Steffens, Jay},
  title   = {bge-small-rrf-lme-v1: chat-memory specialist embedding via vstash's eval-gated labeled retrain},
  year    = {2026},
  url     = {https://huggingface.co/Stffens/bge-small-rrf-lme-v1}
}

For the underlying paper:

@misc{vstash_v2_2026,
  author  = {Steffens, Jay},
  title   = {vstash v2: Eval-Gated Domain Adaptation for Local-First LLM Memory},
  year    = {2026},
  note    = {arXiv preprint, paper v2 of arXiv:2604.15484}
}

For LongMemEval (the evaluation benchmark used here):

@article{wu2024longmemeval,
  title  = {LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory},
  author = {Wu, Di and Wang, Hongwei and Yu, Wenhao and Zhang, Yunsheng and Chang, Kai-Wei and Yu, Dong},
  journal= {arXiv preprint arXiv:2410.10813},
  year   = {2024}
}

For vstash itself:

@software{vstash_2026,
  author  = {Steffens, Jay},
  title   = {vstash: local-first document memory with instant semantic search},
  year    = {2026},
  url     = {https://github.com/stffns/vstash}
}
Downloads last month
41
Safetensors
Model size
33.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Stffens/bge-small-rrf-lme-v1

Finetuned
(354)
this model

Dataset used to train Stffens/bge-small-rrf-lme-v1

Papers for Stffens/bge-small-rrf-lme-v1