Skip to content

[Bugfix][Qwen3-VL] Fix multi-video crash with list-valued fps/num_frames#46305

Merged
ywang96 merged 2 commits into
vllm-project:mainfrom
Sunt-ing:mm-2
Jun 21, 2026
Merged

[Bugfix][Qwen3-VL] Fix multi-video crash with list-valued fps/num_frames#46305
ywang96 merged 2 commits into
vllm-project:mainfrom
Sunt-ing:mm-2

Conversation

@Sunt-ing

Copy link
Copy Markdown
Contributor

Purpose

A Qwen3-VL request that carries more than one video and passes per-video mm_processor_kwargs as a list (one value per video, e.g. fps=[2.0, 4.0] or num_frames=[8, 16]) crashes during preprocessing, before the request reaches inference.

Qwen3VLMultiModalProcessor._call_hf_processor() processes videos in a per-item loop, but it copies the full mm_kwargs to every video without slicing the list-valued per-video kwargs. _get_video_second_idx() then receives the whole list where it expects a scalar:

  • num_frames=[8, 16] raises TypeError: '>' not supported between instances of 'int' and 'list'
  • fps=[2.0, 4.0] raises TypeError: can't multiply sequence by non-int of type 'float'

List-valued per-video fps is an intended representation: _get_prompt_updates.get_video_replacement_qwen3vl already slices it with is_list_of(sampled_fps, float) / sampled_fps[item_idx]. The later per-video processing path simply missed the same slicing.

There is a second leak in the same method: after the video loop, the text/image HF processor call also receives the unsliced mm_kwargs. fps/num_frames are video-only kwargs already consumed by the loop, so forwarding the list there fails again with ValueError: Failed to apply Qwen3VLProcessor.

This PR slices list-valued fps/num_frames by item index inside the per-video loop (mirroring _get_prompt_updates), and drops these video-only kwargs from the final text/image processor call. The scalar path is unchanged.

This is not covered by #36136 (single-video scalar num_frames timestamp) or #37439 (merge_size timestamp fix).

Test Plan

Processor-level regression added to tests/models/multimodal/processing/test_qwen3_vl.py: a two-video request with list-valued num_frames=[8, 16] and fps=[2.0, 4.0], asserting two video placeholders are produced.

pytest tests/models/multimodal/processing/test_qwen3_vl.py -q

End-to-end repro on a real LLM.generate with Qwen3-VL-4B-Instruct, two videos, four cases: scalar num_frames/fps as negative controls and list num_frames/fps as the failing cases.

Test Result

Processor tests (current main + fix), all pass; without the fix the two new list-kwargs cases fail:

PASSED test_processor_num_frames_timestamp[8-...]
PASSED test_processor_num_frames_timestamp[16-...]
PASSED test_processor_multi_video[2-...]
PASSED test_processor_multi_video[4-...]
PASSED test_processor_multi_video_list_kwargs[hf_mm_kwargs0-...]   # num_frames=[8,16]
PASSED test_processor_multi_video_list_kwargs[hf_mm_kwargs1-...]   # fps=[2.0,4.0]
6 passed

# baseline (fix reverted), same two new cases:
FAILED test_processor_multi_video_list_kwargs[hf_mm_kwargs0-...]
FAILED test_processor_multi_video_list_kwargs[hf_mm_kwargs1-...]
2 failed
End-to-end repro (hardware, environment, script, before/after output)

Hardware: 1x RTX 4090. Model: Qwen3-VL-4B-Instruct, enforce_eager=True, max_model_len=2048.

import numpy as np
from vllm import LLM, SamplingParams


def video(num_frames, fps=30.0):
    arr = np.zeros((num_frames, 128, 128, 3), dtype=np.uint8)
    metadata = {
        "fps": fps,
        "duration": num_frames / fps,
        "total_num_frames": num_frames,
        "frames_indices": list(range(num_frames)),
        "video_backend": "opencv",
        "do_sample_frames": True,
    }
    return arr, metadata


prompt = {
    "prompt": (
        "<|vision_start|><|video_pad|><|vision_end|>"
        "<|vision_start|><|video_pad|><|vision_end|>"
        "Describe the two videos briefly."
    ),
    "multi_modal_data": {"video": [video(16), video(32)]},
}

llm = LLM(
    model="Qwen/Qwen3-VL-4B-Instruct",
    max_model_len=2048,
    max_num_seqs=1,
    enforce_eager=True,
    limit_mm_per_prompt={"image": 0, "video": 2},
)
sp = SamplingParams(temperature=0.0, max_tokens=2)

for name, kwargs in [
    ("scalar_num_frames", {"num_frames": 8}),
    ("list_num_frames", {"num_frames": [8, 16]}),
    ("scalar_fps", {"fps": 2.0}),
    ("list_fps", {"fps": [2.0, 4.0]}),
]:
    try:
        llm.generate(prompt, sampling_params=sp, mm_processor_kwargs=kwargs)
        print(f"CASE {name} OK")
    except Exception as exc:
        print(f"CASE {name} FAIL {type(exc).__name__}: {exc}")

Before (current main):

CASE scalar_num_frames OK
CASE list_num_frames FAIL TypeError: '>' not supported between instances of 'int' and 'list'
CASE scalar_fps OK
CASE list_fps FAIL TypeError: can't multiply sequence by non-int of type 'float'

After (with fix):

CASE scalar_num_frames OK
CASE list_num_frames OK
CASE scalar_fps OK
CASE list_fps OK

AI assistance was used to investigate, reproduce, and draft this change; the author reviewed the diff and validation output.

cc @DarkLight1337

A Qwen3-VL request with more than one video that passes per-video
mm_processor_kwargs as a list (one value per video, e.g.
fps=[2.0, 4.0] or num_frames=[8, 16]) crashes during preprocessing.

Qwen3VLMultiModalProcessor._call_hf_processor processes videos in a
per-item loop but copies the full mm_kwargs to every video without
slicing the list-valued per-video kwargs, so _get_video_second_idx
receives the whole list where a scalar is expected:
- num_frames=[8, 16] -> TypeError: '>' not supported between 'int' and 'list'
- fps=[2.0, 4.0] -> TypeError: can't multiply sequence by non-int of type 'float'

List-valued per-video fps is an intended representation;
_get_prompt_updates already slices it with is_list_of(sampled_fps, float).
The per-video processing path simply missed the same slicing.

There is a second leak: after the video loop, the text/image processor
call also receives the unsliced mm_kwargs. fps/num_frames are video-only
kwargs already consumed by the loop, so forwarding the list there fails
with ValueError: Failed to apply Qwen3VLProcessor.

Slice list-valued fps/num_frames by item index in the per-video loop
(mirroring _get_prompt_updates) and drop these video-only kwargs from the
final text/image processor call. The scalar path is unchanged.

Signed-off-by: Ting Sun <suntcrick@gmail.com>
@mergify mergify Bot added multi-modality Related to multi-modality (#4194) qwen Related to Qwen models bug Something isn't working labels Jun 21, 2026

@ywang96 ywang96 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! I left a nit

Comment on lines +1372 to +1376
# fps/num_frames are video-only kwargs already consumed by the loop;
# exclude them so the text/image processor call below never gets a list.
text_mm_kwargs = {
k: v for k, v in mm_kwargs.items() if k not in ("fps", "num_frames")
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rename this to non_video_mm_kwargs for clarity.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks~

Signed-off-by: Ting Sun <suntcrick@gmail.com>
@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 21, 2026
@ywang96 ywang96 merged commit 12fe2a9 into vllm-project:main Jun 21, 2026
6 of 7 checks passed
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
…mes (vllm-project#46305)

Signed-off-by: Ting Sun <suntcrick@gmail.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working multi-modality Related to multi-modality (#4194) qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

3 participants