Integrate with Sentence Transformers v5.4

#2
by tomaarsen HF Staff - opened

Hello!

Pull Request overview

  • Integrate this model using a Sentence Transformers SentenceTransformer

Details

This PR adds the configuration files needed to load this model directly as a SentenceTransformer via Sentence Transformers. The model uses a feature-extraction Transformer with a Normalize module, producing 768-dimensional normalized embeddings via CLIP's projection layers (get_text_features/get_image_features). The model supports text, image, and composed image+text inputs.

Because this model can create text+image retrieval (summing text and image projected embeddings), I've included a small custom BGEVLCLIPTransformer module (bge_vl_clip_transformer.py) that subclasses Sentence Transformers' Transformer. For the ("image", "text") compound modality, it runs text and image through their respective forward paths and sums the resulting embeddings. Text-only and image-only inputs are handled by the parent class directly. This requires trust_remote_code=True when loading the model with Sentence Transformers.

The custom module also overrides load to force trust_remote_code=False for the underlying AutoModel, since the repo's custom modeling_MMRet_CLIP.py has a non-persistent position_ids buffer issue on transformers v5+. The standard CLIPModel loads these weights fine.

Added files:

  • modules.json: pipeline: BGEVLCLIPTransformer & Normalize
  • sentence_bert_config.json: feature-extraction task, multimodal config with get_text_features/get_image_features
  • config_sentence_transformers.json: cosine similarity
  • bge_vl_clip_transformer.py: custom Transformer subclass for composed image+text late fusion

Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/BGE-VL-large", trust_remote_code=True, revision="refs/pr/2")

query_image = "https://huggingface.co/BAAI/BGE-VL-large/resolve/main/assets/cir_query.png"
candidate_1 = "https://huggingface.co/BAAI/BGE-VL-large/resolve/main/assets/cir_candi_1.png"
candidate_2 = "https://huggingface.co/BAAI/BGE-VL-large/resolve/main/assets/cir_candi_2.png"

# Encode text
text_embeddings = model.encode(["A dog sitting on a bench", "A cat sleeping on a couch"])
print(text_embeddings.shape)
# (2, 768)

# Encode images
image_embeddings = model.encode([query_image, candidate_1])
print(image_embeddings.shape)
# (2, 768)

# Composed image retrieval: encode image+text query, compare with image candidates
query_embeddings = model.encode([{
    "image": query_image,
    "text": "Make the background dark, as if the camera has taken the photo at night",
}])
candidate_embeddings = model.encode([candidate_1, candidate_2])
scores = model.similarity(query_embeddings, candidate_embeddings)
print(scores)
# tensor([[0.3696, 0.1714]])

And after merging, the revision argument can be dropped.

Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.

  • Tom Aarsen
tomaarsen changed pull request status to open

cc @JUNJIE99 @ZiyiXia

I wanted to reach out with a bit of extra context, so you understand the reason for these PRs. I'm not sure what the best way to reach you is, so I'll message you here.
Tomorrow, I'll be releasing multimodality support for Sentence Transformers, and I want to use that opportunity to integrate important multimodal models from the community into a common interface. This should also be useful for MTEB and its related projects (for example, MTEB is currently trying to integrate some of these with custom code: https://github.com/embeddings-benchmark/mteb/pull/4310). Today, I've done the integration work for the BGE-VL-base/large and BGE-VL-MLLM-S1/S2 models.

As you'll notice in the PRs, the changes are exclusively additive. This means that there won't be any breaks of the existing functionality and integrations, but just a new simple interface for accessing these models. Sentence Transformers will recognize user input modalities, and preprocess this for the chat template. The filled-in chat template is then tokenized, matching the output of your current models with transformers & trust_remote_code. The BGE-VL-base/large models will still require trust_remote_code in Sentence Transformers for the text+image path, and the MLLM models don't need trust_remote_code at all.

Here's all PRs ready for review:

If you have preferences for other models that I should aim to integrate, please let me know. If you're able to merge these PRs, I'd be glad to mention them in my release blogpost that's going live tomorrow.

Happy to answer any questions!

  • Tom Aarsen
JUNJIE99 changed pull request status to merged
Beijing Academy of Artificial Intelligence org
edited 7 days ago

Hi Tom,

Thanks a lot for the integration — it makes BGE-VL much easier to use. Really appreciate your effort.

I’ve merged the PR. Looking forward to the Sentence Transformers v5.4 release. If there’s anything else we can do to help, please feel free to @ me.

Best,
Junjie

Hello Junjie,

Gladly! The release went live yesterday, and I've now added these models to my release blogpost (https://huggingface.co/blog/multimodal-sentence-transformers#supported-multimodal-embedding-models), which should be a good way to remind people of the models.
I know you have more BGE-VL models, but I'm not very familiar with what distinguishes each, e.g.:

How do these compare to the other models?
My understanding is:

Can you help clarify? Thank you!

  • Tom Aarsen
Beijing Academy of Artificial Intelligence org
edited 7 days ago

Hi Tom,

Congrats on the release, and thanks for adding the models to the blogpost — really appreciate it.

Yes, your summary is basically right — let me clarify a few details.

First, BGE-VL-base, BGE-VL-large, and BGE-VL-MLLM-S1 were trained solely on MegaPairs data. These models are mainly geared toward zero-shot composed image retrieval—that is, image-to-image retrieval with text modification (IT2I)—on benchmarks such as CIRR and CIRCO.

Starting from BGE-VL-MLLM-S1, we further fine-tuned on the MMEB v1 training set to obtain BGE-VL-MLLM-S2. This model is also described in Table 3 of the MegaPairs paper.

As for the v1.5 models, they were updated about half a year after the MegaPairs paper. During that period, we observed substantial progress in MLLM-based multimodal embedding models across the community on broader multi-task benchmarks, which motivated this update.

More specifically, BGE-VL-v1.5-zs is initialized from BGE-VL-MLLM-S1, but it is trained without using any MMEB v1 training data. Instead, we trained it on several million synthetic samples, including part of MegaPairs as well as synthetic image classification, VQA, and image captioning data. On top of BGE-VL-v1.5-zs, we then further fine-tuned on the MMEB v1 training set to obtain BGE-VL-v1.5-mmeb.

Regarding BGE-VL-Screenshot: the first version reported in UniSE was indeed based on Qwen2-VL, which corresponds to UniSE-MLLM:
https://huggingface.co/marsh123/UniSE-MLLM

BGE-VL-Screenshot is a later iteration built on the data resources and training methodology introduced in the UniSE paper, released around three months after the paper. It is based on Qwen2.5-VL, with multilingual support as a major addition.

Sorry for the confusion caused by the naming and versioning. I hope this helps clarify things.

Best,
Junjie

That clears things up a lot! Thanks a bunch.
That means that BGE-VL-v1.5-zs and BGE-VL-v1.5-mmeb are nice and easy to integrate: it just requires the same changes as MLLM-S1/S2. I've opened two PRs:

When testing Transformers v4.x against Sentence Transformers with newer Transformers, I noticed that the old Transformers v4.x code didn't expect the image_processor key in processor_config.json, which caused it to fail. Removing this key would make both work (as the information is also in preprocessor_config.json). So I've made 2 PRs for the MLLM models so they still work nicely in the old way:

Apologies for that.
When these are all merged, then I can inform the MTEB team that they can try and revive their efforts of benchmarking BGE-VL, as this should make it a bit simpler for them: https://github.com/embeddings-benchmark/mteb/pull/4310

  • Tom Aarsen
Beijing Academy of Artificial Intelligence org

Thank you so much for your efforts!

All PRs have been merged.

Sign up or log in to comment