[Model] ColQwen3.5: fix retrieval correctness (bias + bidirectional)#46108
Merged
DarkLight1337 merged 3 commits intoJun 22, 2026
Merged
Conversation
Contributor
|
Documentation preview: https://vllm--46108.org.readthedocs.build/en/46108/ |
The in-tree ColQwen3_5 model (added in vllm-project#36887) deviates from the native colpali ColQwen3_5Processor inference pipeline in three silent ways, none caught by the existing sanity-level tests. Measured cost: ~2.5 ndcg@10 on Vidore3 retrieval. This affects every ColQwen3.5 checkpoint, e.g. athrael-soju/colqwen3.5-4.5B-v3 and athrael-soju/VultronRetrieverPrime-Qwen3.5-8B. 1. Dropped projection bias. colpali defines `custom_text_proj = nn.Linear(hidden, dim)` (bias=True by default) and the trained checkpoints ship a `custom_text_proj.bias`, but the class built the projection with bias=False, so `load_weights` silently skipped the trained bias (it survives per-token L2-norm as a direction change, shifting MaxSim ranking). Build with bias and zero-init: a bias-less checkpoint behaves identically to bias=False, while a trained bias is loaded instead of dropped. 2. Causal attention. ColQwen3.5 retrieval encodes bidirectionally, but the Qwen3-Next `full_attention` layers built `Attention(...)` with the default `AttentionType.DECODER` and never read `config.is_causal`, so `--hf-overrides is_causal` was inert. Add a `ColQwen3_5Config` (VerifyAndUpdateConfig, subclassing the Qwen3.5 one so mamba-cache handling is preserved) that sets `is_causal=False`, and have `Qwen3NextAttention` pass `AttentionType.ENCODER_ONLY` when `is_causal` is False (mirrors qwen3.py; the generation arches leave `is_causal` unset and are unaffected; the GatedDeltaNet `linear_attention` layers are untouched). Encoder-only attention keeps no autoregressive KV cache, so `Attention.get_kv_cache_spec` returns None for ENCODER_ONLY/ENCODER (the hybrid runner iterates every attention module for its KV spec, so it reaches this for the now-encoder full_attention layers — previously it hit `assert attn_type == DECODER`). 3. Processor budget + prompt contract. The rerank example documents the visual-token budget (`max_num_visual_tokens=1792` -> `max_pixels=1792*32^2`, `min_pixels=65536`, via `--mm-processor-kwargs`) and the image instruction template + query augmentation suffix that `ColQwen3_5Processor` applies, needed to match the native pipeline. Validation (B200, vLLM 0.22.1) against a transformers+colpali reference on Vidore3HrRetrieval (full corpus, 6 languages): with all three, vLLM scores 0.6558 vs the reference 0.6478 (Δ +0.0080), up from 0.6231 (Δ −0.0247) on the stock path. The native pipeline value is reproduced within cross-stack numerical noise. No new architecture: both ColQwen3.5 checkpoints load via the existing `ColQwen3_5` arch with the published config, matching how MTEB loads them through `ColQwen3_5Wrapper`. Adds a no-GPU test for the bidirectional config wiring and lists the 8B checkpoint in the docs/example. Signed-off-by: Athrael Soju <athrael.soju@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
61be703 to
ac29129
Compare
Contributor
Author
|
@noooop are you able to add the |
The 8B Prime checkpoint moved from athrael-soju/ to vultr/. Update the model docstring, online rerank example, and supported-models table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
33ca8be to
39ebd66
Compare
nkzhenhua
pushed a commit
to nkzhenhua/vllm
that referenced
this pull request
Jun 24, 2026
…llm-project#46108) Signed-off-by: Athrael Soju <athrael.soju@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
qli88
pushed a commit
to qli88/vllm
that referenced
this pull request
Jun 26, 2026
…llm-project#46108) Signed-off-by: Athrael Soju <athrael.soju@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #36887. The in-tree
ColQwen3_5model deviates from the nativecolpali
ColQwen3_5Processorinference pipeline in three silent ways, none caughtby the current sanity-level tests. Measured cost: ~2.5 ndcg@10 on Vidore3.
This affects every ColQwen3.5 checkpoint — e.g.
athrael-soju/colqwen3.5-4.5B-v3and
athrael-soju/VultronRetrieverPrime-Qwen3.5-8B— both of which MTEB loadsthrough the one
ColQwen3_5Wrapper. This fixes the shared class so both loadcorrectly on vanilla vLLM with their published
architectures: ["ColQwen3_5"].1. Dropped projection bias
colpali's
custom_text_proj = nn.Linear(hidden, dim)is bias=True by default andthe trained checkpoints ship a
custom_text_proj.bias, but the class built itbias=False, soload_weightssilently skipped the trained bias. Now built with abias and zero-initialized: a bias-less checkpoint is unchanged; a trained bias is
loaded instead of dropped.
2. Causal attention (should be bidirectional)
ColQwen3.5 retrieval encodes bidirectionally, but the Qwen3-Next
full_attentionlayers built
Attention(...)with the defaultAttentionType.DECODERand neverread
config.is_causal. This PR:ColQwen3_5Config(aVerifyAndUpdateConfigsubclassing the Qwen3.5 one somamba-cache handling is preserved) setting
is_causal=False;Qwen3NextAttentionpassAttentionType.ENCODER_ONLYwhenis_causalisFalse (mirrors
qwen3.py; generation arches leaveis_causalunset → unchanged;the GatedDeltaNet
linear_attentionlayers are untouched);Attention.get_kv_cache_specreturnNoneforENCODER_ONLY/ENCODER(encoder attention keeps no autoregressive KV cache; the hybrid runner iterates
every attention module for its KV spec, so it reaches this for the now-encoder
full_attention layers — previously it hit
assert attn_type == DECODER).3. Processor budget + prompt contract (docs)
The rerank example documents the visual-token budget
(
max_num_visual_tokens=1792→max_pixels=1792*32^2=1835008,min_pixels=65536,via
--mm-processor-kwargs) and the image instruction template + queryaugmentation suffix that
ColQwen3_5Processorapplies, needed to match the nativepipeline.
Validation
B200, vLLM 0.22.1, vs a transformers+colpali reference on Vidore3HrRetrieval
(full corpus, 6 languages), canonical fp32 per-query MaxSim:
ColQwen3_5Processor(reference)The ~2.5pt deficit is eliminated; the residual +0.8pt is cross-stack numerical
noise (transformers+SDPA vs vLLM+FlashInfer, bf16).
Notes
architectures: ["ColQwen3_5"]andload via the corrected class — consistent with how MTEB loads them.
in the token-embed docs and example.
please run
pytest tests/models/multimodal/pooling/test_colqwen3_5.pyand aVidore3 sweep in CI before merge.
B200/bf16 in this validation.