ColQwen3.5-0.8B-Embedding (LoRA Adapter)

A ColBERT-style multi-vector document retrieval model adapter fine-tuned on top of Qwen/Qwen3.5-0.8B.

0.8B Parameters | LoRA Adapter (r=32, α=32) | Matryoshka Representation Learning

Descriptions

Inspired by ColPali, this model encodes document page images into a sequence of contextualized patch embeddings and uses late interaction MaxSim scoring for retrieval.

This is v2 — an improvement over v1 that adds hard negative mining: trained on a pre-built triplet dataset (LlamaIndex Multilingual, English only) with 3 hard negatives per sample (mined using voyage-3) to improve fine-grained discrimination between visually similar but semantically different pages.

Evaluations

All numbers are NDCG@5.

ViDoRe v1

Evaluated on the ViDoRe v1 benchmark (single relevant doc per query).

Dataset 128-dim 256-dim 512-dim 1024-dim
ArxivQA 0.8176 0.8322 0.8418 0.8462
DocVQA 0.5421 0.5588 0.5640 0.5675
InfoVQA 0.8782 0.8885 0.8940 0.9034
Shift Project 0.8617 0.8875 0.8881 0.9116
Synth AI 0.9906 0.9832 0.9832 0.9869
Synth Energy 0.9752 0.9702 0.9776 0.9776
Synth Gov 0.9598 0.9602 0.9530 0.9529
Synth Health 0.9776 0.9863 0.9863 0.9826
TabFQuAD 0.8441 0.8739 0.8978 0.9078
TAT-DQA 0.7595 0.7788 0.7801 0.7857
Average 0.8606 0.8720 0.8766 0.8822

ViDoRe v2

Evaluated on the ViDoRe v2 benchmark (BEIR format, multi-relevant graded qrels — harder than v1).

v2 differences: each query has ~3.2 relevant pages on average, corpus sizes are 5–30× larger (452–3076 docs), and relevance is graded (score 1 = partially answerable, score 2 = fully answerable).

Dataset Corpus Queries 128-dim 256-dim 512-dim 1024-dim
Biomedical Lectures 1016 640 0.5533 0.5789 0.5783 0.5737
Economics Reports 452 232 0.5685 0.5628 0.5422 0.5161
ESG Reports 1538 228 0.4927 0.5225 0.5162 0.5178
ESG Reports (Human) 3076 104 0.3431 0.3870 0.4372 0.4375
Average 0.4894 0.5128 0.5185 0.5113

Combined Average (v1 + v2 macro)

128-dim 256-dim 512-dim 1024-dim
ViDoRe v1 avg 0.8606 0.8720 0.8766 0.8822
ViDoRe v2 avg 0.4894 0.5128 0.5185 0.5113
Overall avg 0.6750 0.6924 0.6976 0.6968

Evaluations on other benchmarks are forthcoming.

Limitations

  • Training data and training process limitations: v2 is continually fine-tuned on LlamaIndex Multilingual (English only) for 1 epoch with hard negatives (voyage-3 mined). While hard negatives improve negative discrimination, the model is trained on only single language and 1 epoch, which may underutilize the potential of hard-negative learning. Future work should include:

    • Multilingual data (including FR, DE, ES, IT, etc.) to improve cross-language robustness
    • Broader document types beyond academic/financial reports
  • Language-centric training data: The model is fine-tuned using English only. Performance on non-English documents (e.g., Vietnamese, French) is expected to degrade without multilingual fine-tuning.

  • Hard-negative mining bias: Hard negatives are mined using voyage-3-large, which may introduce bias toward that model's embedding space. The quality of hard negatives depends on voyage-3's ranking accuracy; weakly-ranked negatives or false positives (pages with overlapping content) may mislead training.

Usage

Requirements

pillow
transformers==5.3.0
peft==0.18.1
qwen-vl-utils>=0.0.14
torch==2.8.0

Example

from embedder.colqwen3_5_embedder import ColQwen3_5Embedder

embedder = ColQwen3_5Embedder(
    model_name_or_path="Qwen/Qwen3.5-0.8B",
    lora_checkpoint="leo-vnuuet/ColQwen3.5-0.8B-Embedding",
    embed_dim=128
)

queries = [
    {"text": "A woman playing with her dog on a beach at sunset."},
    # {"text": "Pet owner training dog outdoors near water."},
    # {"text": "Woman surfing on waves during a sunny day."},
    # {"text": "City skyline view from a high-rise building at night."},
    # {"text": "A cat"}
]

documents = [
    # {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust."},
    {"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
    # {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
    {"image": "https://catprotection.com.au/wp-content/uploads/2024/04/27A7107-1-e1713776321570.webp"}
]

qry_emb, qry_mask = embedder.process(queries, normalize=True, pooling=False)
doc_emb, doc_mask = embedder.process(documents, normalize=True, pooling=False)

# scores shape: (num_queries, num_docs)
scores = embedder.score_maxsim(qry_emb, doc_emb, qry_mask, doc_mask)

print("Relevance score:")
for q_idx, query in enumerate(queries):
    for d_idx, doc in enumerate(documents):
        print(f"  Q{q_idx+1} vs D{d_idx+1}: {scores[q_idx, d_idx].item():.4f}")

Training Details

v1

Config Values
Base model Qwen/Qwen3.5-0.8B
Training data vidore/colpali_train_set
Epochs 1
Batch size 8 × 4 grad accum = effective 32
Learning rate 5e-5 (cosine, 2.5% warmup)
Optimizer paged_adamw_8bit
LoRA rank r=32, α=32
LoRA targets All linear layers (attention + MLP + DeltaNet)
Loss Matryoshka MaxSim (dims: 128, 256, 512, 1024) - equal weights
Precision bfloat16
Hardware 1× NVIDIA A100-SXM4-80GB

v2 (Current)

The training configuration is the same as v1, but with the addition of hard negative mining using the LlamaIndex Multilingual dataset (English only) and a switch to the AugmentedMaxSimLoss (which incorporates hard negatives) instead of MatryoshkaMaxSimLoss.

Config Values
Training data LlamaIndex Multilingual (English only)
Epochs 1
Learning rate 3e-5 (cosine, 2.5% warmup)
LoRA rank r=32, α=32
Loss AugmentedMaxSimLoss (incorporates hard negatives)
Resume from v1 checkpoint (colqwen3_5_lora/ColQwen3.5-0.8B-Embedding)

Further Contributions

Contributions, experiments, and extensions are welcome

Disclaimer: While my core background is in Software and Systems Engineering, I am currently exploring the depths of training and fine-tuning Vision/Language Models. There is still much to master, and I highly welcome any constructive feedback or insights from the community! Thanks in advance 🤗🫶

License

Apache 2.0 (inherits from base model)

Downloads last month
276
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leo-vnuuet/ColQwen3.5-0.8B-Embedding

Adapter
(58)
this model

Datasets used to train leo-vnuuet/ColQwen3.5-0.8B-Embedding

Collection including leo-vnuuet/ColQwen3.5-0.8B-Embedding

Paper for leo-vnuuet/ColQwen3.5-0.8B-Embedding