ColQwen3.5-0.8B-Embedding (LoRA Adapter)
A ColBERT-style multi-vector document retrieval model adapter fine-tuned on top of Qwen/Qwen3.5-0.8B.
0.8B Parameters | LoRA Adapter (r=32, α=32) | Matryoshka Representation Learning
Descriptions
Inspired by ColPali, this model encodes document page images into a sequence of contextualized patch embeddings and uses late interaction MaxSim scoring for retrieval.
This is v2 — an improvement over v1 that adds hard negative mining: trained on a pre-built triplet dataset (LlamaIndex Multilingual, English only) with 3 hard negatives per sample (mined using voyage-3) to improve fine-grained discrimination between visually similar but semantically different pages.
Evaluations
All numbers are NDCG@5.
ViDoRe v1
Evaluated on the ViDoRe v1 benchmark (single relevant doc per query).
| Dataset | 128-dim | 256-dim | 512-dim | 1024-dim |
|---|---|---|---|---|
| ArxivQA | 0.8176 | 0.8322 | 0.8418 | 0.8462 |
| DocVQA | 0.5421 | 0.5588 | 0.5640 | 0.5675 |
| InfoVQA | 0.8782 | 0.8885 | 0.8940 | 0.9034 |
| Shift Project | 0.8617 | 0.8875 | 0.8881 | 0.9116 |
| Synth AI | 0.9906 | 0.9832 | 0.9832 | 0.9869 |
| Synth Energy | 0.9752 | 0.9702 | 0.9776 | 0.9776 |
| Synth Gov | 0.9598 | 0.9602 | 0.9530 | 0.9529 |
| Synth Health | 0.9776 | 0.9863 | 0.9863 | 0.9826 |
| TabFQuAD | 0.8441 | 0.8739 | 0.8978 | 0.9078 |
| TAT-DQA | 0.7595 | 0.7788 | 0.7801 | 0.7857 |
| Average | 0.8606 | 0.8720 | 0.8766 | 0.8822 |
ViDoRe v2
Evaluated on the ViDoRe v2 benchmark (BEIR format, multi-relevant graded qrels — harder than v1).
v2 differences: each query has ~3.2 relevant pages on average, corpus sizes are 5–30× larger (452–3076 docs), and relevance is graded (score 1 = partially answerable, score 2 = fully answerable).
| Dataset | Corpus | Queries | 128-dim | 256-dim | 512-dim | 1024-dim |
|---|---|---|---|---|---|---|
| Biomedical Lectures | 1016 | 640 | 0.5533 | 0.5789 | 0.5783 | 0.5737 |
| Economics Reports | 452 | 232 | 0.5685 | 0.5628 | 0.5422 | 0.5161 |
| ESG Reports | 1538 | 228 | 0.4927 | 0.5225 | 0.5162 | 0.5178 |
| ESG Reports (Human) | 3076 | 104 | 0.3431 | 0.3870 | 0.4372 | 0.4375 |
| Average | 0.4894 | 0.5128 | 0.5185 | 0.5113 |
Combined Average (v1 + v2 macro)
| 128-dim | 256-dim | 512-dim | 1024-dim | |
|---|---|---|---|---|
| ViDoRe v1 avg | 0.8606 | 0.8720 | 0.8766 | 0.8822 |
| ViDoRe v2 avg | 0.4894 | 0.5128 | 0.5185 | 0.5113 |
| Overall avg | 0.6750 | 0.6924 | 0.6976 | 0.6968 |
Evaluations on other benchmarks are forthcoming.
Limitations
Training data and training process limitations: v2 is continually fine-tuned on LlamaIndex Multilingual (English only) for 1 epoch with hard negatives (voyage-3 mined). While hard negatives improve negative discrimination, the model is trained on only single language and 1 epoch, which may underutilize the potential of hard-negative learning. Future work should include:
- Multilingual data (including FR, DE, ES, IT, etc.) to improve cross-language robustness
- Broader document types beyond academic/financial reports
Language-centric training data: The model is fine-tuned using English only. Performance on non-English documents (e.g., Vietnamese, French) is expected to degrade without multilingual fine-tuning.
Hard-negative mining bias: Hard negatives are mined using voyage-3-large, which may introduce bias toward that model's embedding space. The quality of hard negatives depends on voyage-3's ranking accuracy; weakly-ranked negatives or false positives (pages with overlapping content) may mislead training.
Usage
Requirements
pillow
transformers==5.3.0
peft==0.18.1
qwen-vl-utils>=0.0.14
torch==2.8.0
Example
from embedder.colqwen3_5_embedder import ColQwen3_5Embedder
embedder = ColQwen3_5Embedder(
model_name_or_path="Qwen/Qwen3.5-0.8B",
lora_checkpoint="leo-vnuuet/ColQwen3.5-0.8B-Embedding",
embed_dim=128
)
queries = [
{"text": "A woman playing with her dog on a beach at sunset."},
# {"text": "Pet owner training dog outdoors near water."},
# {"text": "Woman surfing on waves during a sunny day."},
# {"text": "City skyline view from a high-rise building at night."},
# {"text": "A cat"}
]
documents = [
# {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust."},
{"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
# {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
{"image": "https://catprotection.com.au/wp-content/uploads/2024/04/27A7107-1-e1713776321570.webp"}
]
qry_emb, qry_mask = embedder.process(queries, normalize=True, pooling=False)
doc_emb, doc_mask = embedder.process(documents, normalize=True, pooling=False)
# scores shape: (num_queries, num_docs)
scores = embedder.score_maxsim(qry_emb, doc_emb, qry_mask, doc_mask)
print("Relevance score:")
for q_idx, query in enumerate(queries):
for d_idx, doc in enumerate(documents):
print(f" Q{q_idx+1} vs D{d_idx+1}: {scores[q_idx, d_idx].item():.4f}")
Training Details
v1
| Config | Values |
|---|---|
| Base model | Qwen/Qwen3.5-0.8B |
| Training data | vidore/colpali_train_set |
| Epochs | 1 |
| Batch size | 8 × 4 grad accum = effective 32 |
| Learning rate | 5e-5 (cosine, 2.5% warmup) |
| Optimizer | paged_adamw_8bit |
| LoRA rank | r=32, α=32 |
| LoRA targets | All linear layers (attention + MLP + DeltaNet) |
| Loss | Matryoshka MaxSim (dims: 128, 256, 512, 1024) - equal weights |
| Precision | bfloat16 |
| Hardware | 1× NVIDIA A100-SXM4-80GB |
v2 (Current)
The training configuration is the same as v1, but with the addition of hard negative mining using the LlamaIndex Multilingual dataset (English only) and a switch to the AugmentedMaxSimLoss (which incorporates hard negatives) instead of MatryoshkaMaxSimLoss.
| Config | Values |
|---|---|
| Training data | LlamaIndex Multilingual (English only) |
| Epochs | 1 |
| Learning rate | 3e-5 (cosine, 2.5% warmup) |
| LoRA rank | r=32, α=32 |
| Loss | AugmentedMaxSimLoss (incorporates hard negatives) |
| Resume from | v1 checkpoint (colqwen3_5_lora/ColQwen3.5-0.8B-Embedding) |
Further Contributions
Contributions, experiments, and extensions are welcome
Disclaimer: While my core background is in Software and Systems Engineering, I am currently exploring the depths of training and fine-tuning Vision/Language Models. There is still much to master, and I highly welcome any constructive feedback or insights from the community! Thanks in advance 🤗🫶
License
Apache 2.0 (inherits from base model)
- Downloads last month
- 276