[Model] ColQwen3.5: fix retrieval correctness (bias + bidirectional) by athrael-soju · Pull Request #46108 · vllm-project/vllm

athrael-soju · 2026-06-18T22:09:39Z

Summary

Follow-up to #36887. The in-tree ColQwen3_5 model deviates from the native
colpali ColQwen3_5Processor inference pipeline in three silent ways, none caught
by the current sanity-level tests. Measured cost: ~2.5 ndcg@10 on Vidore3.
This affects every ColQwen3.5 checkpoint — e.g. athrael-soju/colqwen3.5-4.5B-v3
and athrael-soju/VultronRetrieverPrime-Qwen3.5-8B — both of which MTEB loads
through the one ColQwen3_5Wrapper. This fixes the shared class so both load
correctly on vanilla vLLM with their published architectures: ["ColQwen3_5"].

1. Dropped projection bias

colpali's custom_text_proj = nn.Linear(hidden, dim) is bias=True by default and
the trained checkpoints ship a custom_text_proj.bias, but the class built it
bias=False, so load_weights silently skipped the trained bias. Now built with a
bias and zero-initialized: a bias-less checkpoint is unchanged; a trained bias is
loaded instead of dropped.

2. Causal attention (should be bidirectional)

ColQwen3.5 retrieval encodes bidirectionally, but the Qwen3-Next full_attention
layers built Attention(...) with the default AttentionType.DECODER and never
read config.is_causal. This PR:

adds ColQwen3_5Config (a VerifyAndUpdateConfig subclassing the Qwen3.5 one so
mamba-cache handling is preserved) setting is_causal=False;
has Qwen3NextAttention pass AttentionType.ENCODER_ONLY when is_causal is
False (mirrors qwen3.py; generation arches leave is_causal unset → unchanged;
the GatedDeltaNet linear_attention layers are untouched);
makes Attention.get_kv_cache_spec return None for ENCODER_ONLY/ENCODER
(encoder attention keeps no autoregressive KV cache; the hybrid runner iterates
every attention module for its KV spec, so it reaches this for the now-encoder
full_attention layers — previously it hit assert attn_type == DECODER).

3. Processor budget + prompt contract (docs)

The rerank example documents the visual-token budget
(max_num_visual_tokens=1792 → max_pixels=1792*32^2=1835008, min_pixels=65536,
via --mm-processor-kwargs) and the image instruction template + query
augmentation suffix that ColQwen3_5Processor applies, needed to match the native
pipeline.

Validation

B200, vLLM 0.22.1, vs a transformers+colpali reference on Vidore3HrRetrieval
(full corpus, 6 languages), canonical fp32 per-query MaxSim:

path	ndcg@10	Δ vs reference
transformers + `ColQwen3_5Processor` (reference)	0.6478	—
vLLM in-tree (before)	0.6231	−0.0247
vLLM + this PR	0.6558	+0.0080

The ~2.5pt deficit is eliminated; the residual +0.8pt is cross-stack numerical
noise (transformers+SDPA vs vLLM+FlashInfer, bf16).

Notes

No new architecture: both checkpoints keep architectures: ["ColQwen3_5"] and
load via the corrected class — consistent with how MTEB loads them.
Adds a no-GPU test for the bidirectional config wiring; lists the 8B checkpoint
in the token-embed docs and example.
The edits were validated via the equivalent out-of-tree plugin on B200/0.22.1;
please run pytest tests/models/multimodal/pooling/test_colqwen3_5.py and a
Vidore3 sweep in CI before merge.
Backbone hybrid-FLA correctness ([Bug]: Qwen3.5 (Qwen3_5ForConditionalGeneration) FLA linear attention tensor format mismatch causes gibberish output #38643) is orthogonal; not affected on
B200/bf16 in this validation.

mergify · 2026-06-18T22:10:15Z

Documentation preview: https://vllm--46108.org.readthedocs.build/en/46108/

The in-tree ColQwen3_5 model (added in vllm-project#36887) deviates from the native colpali ColQwen3_5Processor inference pipeline in three silent ways, none caught by the existing sanity-level tests. Measured cost: ~2.5 ndcg@10 on Vidore3 retrieval. This affects every ColQwen3.5 checkpoint, e.g. athrael-soju/colqwen3.5-4.5B-v3 and athrael-soju/VultronRetrieverPrime-Qwen3.5-8B. 1. Dropped projection bias. colpali defines `custom_text_proj = nn.Linear(hidden, dim)` (bias=True by default) and the trained checkpoints ship a `custom_text_proj.bias`, but the class built the projection with bias=False, so `load_weights` silently skipped the trained bias (it survives per-token L2-norm as a direction change, shifting MaxSim ranking). Build with bias and zero-init: a bias-less checkpoint behaves identically to bias=False, while a trained bias is loaded instead of dropped. 2. Causal attention. ColQwen3.5 retrieval encodes bidirectionally, but the Qwen3-Next `full_attention` layers built `Attention(...)` with the default `AttentionType.DECODER` and never read `config.is_causal`, so `--hf-overrides is_causal` was inert. Add a `ColQwen3_5Config` (VerifyAndUpdateConfig, subclassing the Qwen3.5 one so mamba-cache handling is preserved) that sets `is_causal=False`, and have `Qwen3NextAttention` pass `AttentionType.ENCODER_ONLY` when `is_causal` is False (mirrors qwen3.py; the generation arches leave `is_causal` unset and are unaffected; the GatedDeltaNet `linear_attention` layers are untouched). Encoder-only attention keeps no autoregressive KV cache, so `Attention.get_kv_cache_spec` returns None for ENCODER_ONLY/ENCODER (the hybrid runner iterates every attention module for its KV spec, so it reaches this for the now-encoder full_attention layers — previously it hit `assert attn_type == DECODER`). 3. Processor budget + prompt contract. The rerank example documents the visual-token budget (`max_num_visual_tokens=1792` -> `max_pixels=1792*32^2`, `min_pixels=65536`, via `--mm-processor-kwargs`) and the image instruction template + query augmentation suffix that `ColQwen3_5Processor` applies, needed to match the native pipeline. Validation (B200, vLLM 0.22.1) against a transformers+colpali reference on Vidore3HrRetrieval (full corpus, 6 languages): with all three, vLLM scores 0.6558 vs the reference 0.6478 (Δ +0.0080), up from 0.6231 (Δ −0.0247) on the stock path. The native pipeline value is reproduced within cross-stack numerical noise. No new architecture: both ColQwen3.5 checkpoints load via the existing `ColQwen3_5` arch with the published config, matching how MTEB loads them through `ColQwen3_5Wrapper`. Adds a no-GPU test for the bidirectional config wiring and lists the 8B checkpoint in the docs/example. Signed-off-by: Athrael Soju <athrael.soju@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

athrael-soju · 2026-06-18T22:24:54Z

@noooop are you able to add the ready label for the build to resume? Thanks

The 8B Prime checkpoint moved from athrael-soju/ to vultr/. Update the model docstring, online rerank example, and supported-models table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>

noooop

thanks

…llm-project#46108) Signed-off-by: Athrael Soju <athrael.soju@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…llm-project#46108) Signed-off-by: Athrael Soju <athrael.soju@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

athrael-soju requested review from LucasWilkinson, MatthewBonanni, noooop, sighingnow and vadiklyutiy as code owners June 18, 2026 22:09

mergify Bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) qwen Related to Qwen models labels Jun 18, 2026

athrael-soju force-pushed the fix-colqwen3_5-retrieval-correctness branch from 61be703 to ac29129 Compare June 18, 2026 22:20

athrael-soju force-pushed the fix-colqwen3_5-retrieval-correctness branch from 33ca8be to 39ebd66 Compare June 19, 2026 16:01

Merge branch 'main' into fix-colqwen3_5-retrieval-correctness

7e0a6cd

noooop added ready ONLY add when PR is ready to merge/full CI is needed build-docs labels Jun 22, 2026

noooop approved these changes Jun 22, 2026

View reviewed changes

DarkLight1337 merged commit 3c8e495 into vllm-project:main Jun 22, 2026
95 of 96 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Model] ColQwen3.5: fix retrieval correctness (bias + bidirectional)#46108

[Model] ColQwen3.5: fix retrieval correctness (bias + bidirectional)#46108
DarkLight1337 merged 3 commits into
vllm-project:mainfrom
athrael-soju:fix-colqwen3_5-retrieval-correctness

athrael-soju commented Jun 18, 2026

mergify Bot commented Jun 18, 2026

athrael-soju commented Jun 18, 2026

noooop left a comment

Uh oh!

Labels

3 participants

Uh oh!

Uh oh!

Conversation

athrael-soju commented Jun 18, 2026

Summary

1. Dropped projection bias

2. Causal attention (should be bidirectional)

3. Processor budget + prompt contract (docs)

Validation

Notes

mergify Bot commented Jun 18, 2026

athrael-soju commented Jun 18, 2026

noooop left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants