Skip to content

[Model] ColQwen3.5: fix retrieval correctness (bias + bidirectional)#46108

Merged
DarkLight1337 merged 3 commits into
vllm-project:mainfrom
athrael-soju:fix-colqwen3_5-retrieval-correctness
Jun 22, 2026
Merged

[Model] ColQwen3.5: fix retrieval correctness (bias + bidirectional)#46108
DarkLight1337 merged 3 commits into
vllm-project:mainfrom
athrael-soju:fix-colqwen3_5-retrieval-correctness

Conversation

@athrael-soju

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #36887. The in-tree ColQwen3_5 model deviates from the native
colpali ColQwen3_5Processor inference pipeline in three silent ways, none caught
by the current sanity-level tests. Measured cost: ~2.5 ndcg@10 on Vidore3.
This affects every ColQwen3.5 checkpoint — e.g. athrael-soju/colqwen3.5-4.5B-v3
and athrael-soju/VultronRetrieverPrime-Qwen3.5-8B — both of which MTEB loads
through the one ColQwen3_5Wrapper. This fixes the shared class so both load
correctly on vanilla vLLM with their published architectures: ["ColQwen3_5"].

1. Dropped projection bias

colpali's custom_text_proj = nn.Linear(hidden, dim) is bias=True by default and
the trained checkpoints ship a custom_text_proj.bias, but the class built it
bias=False, so load_weights silently skipped the trained bias. Now built with a
bias and zero-initialized: a bias-less checkpoint is unchanged; a trained bias is
loaded instead of dropped.

2. Causal attention (should be bidirectional)

ColQwen3.5 retrieval encodes bidirectionally, but the Qwen3-Next full_attention
layers built Attention(...) with the default AttentionType.DECODER and never
read config.is_causal. This PR:

  • adds ColQwen3_5Config (a VerifyAndUpdateConfig subclassing the Qwen3.5 one so
    mamba-cache handling is preserved) setting is_causal=False;
  • has Qwen3NextAttention pass AttentionType.ENCODER_ONLY when is_causal is
    False (mirrors qwen3.py; generation arches leave is_causal unset → unchanged;
    the GatedDeltaNet linear_attention layers are untouched);
  • makes Attention.get_kv_cache_spec return None for ENCODER_ONLY/ENCODER
    (encoder attention keeps no autoregressive KV cache; the hybrid runner iterates
    every attention module for its KV spec, so it reaches this for the now-encoder
    full_attention layers — previously it hit assert attn_type == DECODER).

3. Processor budget + prompt contract (docs)

The rerank example documents the visual-token budget
(max_num_visual_tokens=1792max_pixels=1792*32^2=1835008, min_pixels=65536,
via --mm-processor-kwargs) and the image instruction template + query
augmentation suffix that ColQwen3_5Processor applies, needed to match the native
pipeline.

Validation

B200, vLLM 0.22.1, vs a transformers+colpali reference on Vidore3HrRetrieval
(full corpus, 6 languages), canonical fp32 per-query MaxSim:

path ndcg@10 Δ vs reference
transformers + ColQwen3_5Processor (reference) 0.6478
vLLM in-tree (before) 0.6231 −0.0247
vLLM + this PR 0.6558 +0.0080

The ~2.5pt deficit is eliminated; the residual +0.8pt is cross-stack numerical
noise (transformers+SDPA vs vLLM+FlashInfer, bf16).

Notes

  • No new architecture: both checkpoints keep architectures: ["ColQwen3_5"] and
    load via the corrected class — consistent with how MTEB loads them.
  • Adds a no-GPU test for the bidirectional config wiring; lists the 8B checkpoint
    in the token-embed docs and example.
  • The edits were validated via the equivalent out-of-tree plugin on B200/0.22.1;
    please run pytest tests/models/multimodal/pooling/test_colqwen3_5.py and a
    Vidore3 sweep in CI before merge.
  • Backbone hybrid-FLA correctness ([Bug]: Qwen3.5 (Qwen3_5ForConditionalGeneration) FLA linear attention tensor format mismatch causes gibberish output #38643) is orthogonal; not affected on
    B200/bf16 in this validation.
@mergify

mergify Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor
@mergify mergify Bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) qwen Related to Qwen models labels Jun 18, 2026
The in-tree ColQwen3_5 model (added in vllm-project#36887) deviates from the native colpali
ColQwen3_5Processor inference pipeline in three silent ways, none caught by the
existing sanity-level tests. Measured cost: ~2.5 ndcg@10 on Vidore3 retrieval.
This affects every ColQwen3.5 checkpoint, e.g. athrael-soju/colqwen3.5-4.5B-v3 and
athrael-soju/VultronRetrieverPrime-Qwen3.5-8B.

1. Dropped projection bias. colpali defines `custom_text_proj = nn.Linear(hidden,
   dim)` (bias=True by default) and the trained checkpoints ship a
   `custom_text_proj.bias`, but the class built the projection with bias=False, so
   `load_weights` silently skipped the trained bias (it survives per-token L2-norm
   as a direction change, shifting MaxSim ranking). Build with bias and zero-init:
   a bias-less checkpoint behaves identically to bias=False, while a trained bias
   is loaded instead of dropped.

2. Causal attention. ColQwen3.5 retrieval encodes bidirectionally, but the
   Qwen3-Next `full_attention` layers built `Attention(...)` with the default
   `AttentionType.DECODER` and never read `config.is_causal`, so
   `--hf-overrides is_causal` was inert. Add a `ColQwen3_5Config`
   (VerifyAndUpdateConfig, subclassing the Qwen3.5 one so mamba-cache handling is
   preserved) that sets `is_causal=False`, and have `Qwen3NextAttention` pass
   `AttentionType.ENCODER_ONLY` when `is_causal` is False (mirrors qwen3.py; the
   generation arches leave `is_causal` unset and are unaffected; the GatedDeltaNet
   `linear_attention` layers are untouched). Encoder-only attention keeps no
   autoregressive KV cache, so `Attention.get_kv_cache_spec` returns None for
   ENCODER_ONLY/ENCODER (the hybrid runner iterates every attention module for its
   KV spec, so it reaches this for the now-encoder full_attention layers —
   previously it hit `assert attn_type == DECODER`).

3. Processor budget + prompt contract. The rerank example documents the
   visual-token budget (`max_num_visual_tokens=1792` -> `max_pixels=1792*32^2`,
   `min_pixels=65536`, via `--mm-processor-kwargs`) and the image instruction
   template + query augmentation suffix that `ColQwen3_5Processor` applies, needed
   to match the native pipeline.

Validation (B200, vLLM 0.22.1) against a transformers+colpali reference on
Vidore3HrRetrieval (full corpus, 6 languages): with all three, vLLM scores 0.6558
vs the reference 0.6478 (Δ +0.0080), up from 0.6231 (Δ −0.0247) on the stock path.
The native pipeline value is reproduced within cross-stack numerical noise.

No new architecture: both ColQwen3.5 checkpoints load via the existing
`ColQwen3_5` arch with the published config, matching how MTEB loads them through
`ColQwen3_5Wrapper`. Adds a no-GPU test for the bidirectional config wiring and
lists the 8B checkpoint in the docs/example.

Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@athrael-soju athrael-soju force-pushed the fix-colqwen3_5-retrieval-correctness branch from 61be703 to ac29129 Compare June 18, 2026 22:20
@athrael-soju

Copy link
Copy Markdown
Contributor Author

@noooop are you able to add the ready label for the build to resume? Thanks

The 8B Prime checkpoint moved from athrael-soju/ to vultr/. Update the
model docstring, online rerank example, and supported-models table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
@athrael-soju athrael-soju force-pushed the fix-colqwen3_5-retrieval-correctness branch from 33ca8be to 39ebd66 Compare June 19, 2026 16:01
@noooop noooop added ready ONLY add when PR is ready to merge/full CI is needed build-docs labels Jun 22, 2026

@noooop noooop left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@DarkLight1337 DarkLight1337 merged commit 3c8e495 into vllm-project:main Jun 22, 2026
95 of 96 checks passed
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…llm-project#46108)

Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
…llm-project#46108)

Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build-docs documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

3 participants