ColQwen3.5-0.8B-Embedding (LoRA Adapter)

A ColBERT-style multi-vector document retrieval model adapter fine-tuned on top of Qwen/Qwen3.5-0.8B.

0.8B Parameters | LoRA Adapter (r=32, α=32) | Matryoshka Representation Learning

Descriptions

Inspired by ColPali, this model encodes document page images into a sequence of contextualized patch embeddings and uses late interaction MaxSim scoring for retrieval.

This is v2 — an improvement over v1 that adds hard negative mining: trained on a pre-built triplet dataset (LlamaIndex Multilingual, English only) with 3 hard negatives per sample (mined using voyage-3) to improve fine-grained discrimination between visually similar but semantically different pages.

Evaluations

All numbers are NDCG@5.

ViDoRe v1

Evaluated on the ViDoRe v1 benchmark (single relevant doc per query).

Dataset	128-dim	256-dim	512-dim	1024-dim
ArxivQA	0.8176	0.8322	0.8418	0.8462
DocVQA	0.5421	0.5588	0.5640	0.5675
InfoVQA	0.8782	0.8885	0.8940	0.9034
Shift Project	0.8617	0.8875	0.8881	0.9116
Synth AI	0.9906	0.9832	0.9832	0.9869
Synth Energy	0.9752	0.9702	0.9776	0.9776
Synth Gov	0.9598	0.9602	0.9530	0.9529
Synth Health	0.9776	0.9863	0.9863	0.9826
TabFQuAD	0.8441	0.8739	0.8978	0.9078
TAT-DQA	0.7595	0.7788	0.7801	0.7857
Average	0.8606	0.8720	0.8766	0.8822

ViDoRe v2

Evaluated on the ViDoRe v2 benchmark (BEIR format, multi-relevant graded qrels — harder than v1).

v2 differences: each query has ~3.2 relevant pages on average, corpus sizes are 5–30× larger (452–3076 docs), and relevance is graded (score 1 = partially answerable, score 2 = fully answerable).

Dataset	Corpus	Queries	128-dim	256-dim	512-dim	1024-dim
Biomedical Lectures	1016	640	0.5533	0.5789	0.5783	0.5737
Economics Reports	452	232	0.5685	0.5628	0.5422	0.5161
ESG Reports	1538	228	0.4927	0.5225	0.5162	0.5178
ESG Reports (Human)	3076	104	0.3431	0.3870	0.4372	0.4375
Average			0.4894	0.5128	0.5185	0.5113

Combined Average (v1 + v2 macro)

	128-dim	256-dim	512-dim	1024-dim
ViDoRe v1 avg	0.8606	0.8720	0.8766	0.8822
ViDoRe v2 avg	0.4894	0.5128	0.5185	0.5113
Overall avg	0.6750	0.6924	0.6976	0.6968

Evaluations on other benchmarks are forthcoming.

Limitations

Training data and training process limitations: v2 is continually fine-tuned on LlamaIndex Multilingual (English only) for 1 epoch with hard negatives (voyage-3 mined). While hard negatives improve negative discrimination, the model is trained on only single language and 1 epoch, which may underutilize the potential of hard-negative learning. Future work should include:
- Multilingual data (including FR, DE, ES, IT, etc.) to improve cross-language robustness
- Broader document types beyond academic/financial reports
Language-centric training data: The model is fine-tuned using English only. Performance on non-English documents (e.g., Vietnamese, French) is expected to degrade without multilingual fine-tuning.
Hard-negative mining bias: Hard negatives are mined using voyage-3-large, which may introduce bias toward that model's embedding space. The quality of hard negatives depends on voyage-3's ranking accuracy; weakly-ranked negatives or false positives (pages with overlapping content) may mislead training.

Usage

Requirements

pillow
transformers==5.3.0
peft==0.18.1
qwen-vl-utils>=0.0.14
torch==2.8.0

Example

from embedder.colqwen3_5_embedder import ColQwen3_5Embedder

embedder = ColQwen3_5Embedder(
    model_name_or_path="Qwen/Qwen3.5-0.8B",
    lora_checkpoint="leo-vnuuet/ColQwen3.5-0.8B-Embedding",
    embed_dim=128
)

queries = [
    {"text": "A woman playing with her dog on a beach at sunset."},
    # {"text": "Pet owner training dog outdoors near water."},
    # {"text": "Woman surfing on waves during a sunny day."},
    # {"text": "City skyline view from a high-rise building at night."},
    # {"text": "A cat"}
]

documents = [
    # {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust."},
    {"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
    # {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
    {"image": "https://catprotection.com.au/wp-content/uploads/2024/04/27A7107-1-e1713776321570.webp"}
]

qry_emb, qry_mask = embedder.process(queries, normalize=True, pooling=False)
doc_emb, doc_mask = embedder.process(documents, normalize=True, pooling=False)

# scores shape: (num_queries, num_docs)
scores = embedder.score_maxsim(qry_emb, doc_emb, qry_mask, doc_mask)

print("Relevance score:")
for q_idx, query in enumerate(queries):
    for d_idx, doc in enumerate(documents):
        print(f"  Q{q_idx+1} vs D{d_idx+1}: {scores[q_idx, d_idx].item():.4f}")

Training Details

v1

Config	Values
Base model	Qwen/Qwen3.5-0.8B
Training data	vidore/colpali_train_set
Epochs	1
Batch size	8 × 4 grad accum = effective 32
Learning rate	5e-5 (cosine, 2.5% warmup)
Optimizer	paged_adamw_8bit
LoRA rank	r=32, α=32
LoRA targets	All linear layers (attention + MLP + DeltaNet)
Loss	Matryoshka MaxSim (dims: 128, 256, 512, 1024) - equal weights
Precision	bfloat16
Hardware	1× NVIDIA A100-SXM4-80GB

v2 (Current)

The training configuration is the same as v1, but with the addition of hard negative mining using the LlamaIndex Multilingual dataset (English only) and a switch to the AugmentedMaxSimLoss (which incorporates hard negatives) instead of MatryoshkaMaxSimLoss.

Config	Values
Training data	LlamaIndex Multilingual (English only)
Epochs	1
Learning rate	3e-5 (cosine, 2.5% warmup)
LoRA rank	r=32, α=32
Loss	AugmentedMaxSimLoss (incorporates hard negatives)
Resume from	v1 checkpoint (`colqwen3_5_lora/ColQwen3.5-0.8B-Embedding`)

Further Contributions

Contributions, experiments, and extensions are welcome

Disclaimer: While my core background is in Software and Systems Engineering, I am currently exploring the depths of training and fine-tuning Vision/Language Models. There is still much to master, and I highly welcome any constructive feedback or insights from the community! Thanks in advance 🤗🫶