[Bugfix][Model] Qwen3-Omni: move cu_seqlens to GPU before VIT attention by liulanze · Pull Request #44264 · vllm-project/vllm

liulanze · 2026-06-02T00:17:04Z

Purpose

During profile_run the multimodal grid_thw arrives on CPU, so the cu_seqlens tensor built from it inherits the CPU device. Passing this CPU tensor into the FA3 vit attention path raises RuntimeError: cu_seqlens_q must be on CUDA and crashes engine init when loading Qwen/Qwen3-Omni-30B-A3B-Thinking (and the Instruct variant).

Move cu_seqlens to self.device after construction so the FA3 wrapper receives a CUDA tensor regardless of where grid_thw lives. The sibling qwen3_vl model already routes cu_seqlens through MMEncoderAttention.maybe_recompute_cu_seqlens(..., device=self.device) for the same reason; this PR mirrors that guarantee minimally for qwen3_omni_moe_thinker.

Why this is not a duplicate: gh pr list --repo vllm-project/vllm --state open --search "qwen3_omni" returns 0 open PRs.

Test Plan

Reproduce and verify on a single H100 80 GB (CUDA 13, driver 580.105.08), matching the reporter's environment.

uv venv --python 3.12 && source .venv/bin/activate
uv pip install vllm==0.22.0 hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
hf download Qwen/Qwen3-Omni-30B-A3B-Thinking

Repro script (matches the failing config in #44180):

from vllm import LLM
llm = LLM(
    model="Qwen/Qwen3-Omni-30B-A3B-Thinking",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.9,
    enforce_eager=True,
    limit_mm_per_prompt={"image": 1, "video": 1, "audio": 1},
)
print("LOAD_OK")

Lint:

ruff check vllm/model_executor/models/qwen3_omni_moe_thinker.py
ruff format --check vllm/model_executor/models/qwen3_omni_moe_thinker.py

Test Result

Before patch (crashes during profile_run):

File ".../vllm/model_executor/models/qwen3_omni_moe_thinker.py", line 671, in forward
    x = x + self.attn(...
File ".../vllm/v1/attention/ops/vit_attn_wrappers.py", line 51, in flash_attn_maxseqlen_wrapper
    output = flash_attn_varlen_func(...
File ".../vllm/vllm_flash_attn/flash_attn_interface.py", line 328, in flash_attn_varlen_func
    out, softmax_lse, _, _ = torch.ops._vllm_fa3_C.fwd(...
RuntimeError: cu_seqlens_q must be on CUDA
RuntimeError: Engine core initialization failed.

After patch:

INFO 06-02 00:02:29 [gpu_worker.py:466] Available KV cache memory: 9.41 GiB
INFO 06-02 00:02:29 [kv_cache_utils.py:1733] GPU KV cache size: 102,784 tokens
INFO 06-02 00:02:37 [core.py:309] init engine (profile, create kv cache, warmup model) took 873.27 s
LOAD_OK

End-to-end image VIT inference (224×224 solid-red PIL image, "Color shown? One word."):

OUT: <think>
Got it, the user is asking for the color shown in the image, and it's a single word. The image is a solid red color, ...

Lint: All checks passed! and 1 file already formatted.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

ZJY0516 · 2026-06-08T16:34:07Z

@claude review

claude

LGTM — minimal targeted device fix for FA3 VIT attention, mirrors existing qwen3_vl pattern.

Extended reasoning...

Overview

This PR adds a single line (cu_seqlens = cu_seqlens.to(self.device, non_blocking=True)) in Qwen3Omni_VisionTransformer.forward in vllm/model_executor/models/qwen3_omni_moe_thinker.py. The fix addresses a crash during profile_run where grid_thw arrives on CPU, causing the derived cu_seqlens to also be on CPU, which then fails when passed to FA3 (cu_seqlens_q must be on CUDA).

Security risks

None — this is a tensor device placement change, no auth/crypto/permission surface.

Level of scrutiny

Low. This is a straightforward bugfix that fixes engine initialization crash for a specific model. The fix is mechanical (move a tensor to the correct device), follows an existing pattern in the sibling qwen3_vl model, and is contained to a single forward pass. The author provided clear before/after repro logs from a real H100 run.

Other factors

The change is additive and idempotent — calling .to(device) on an already-on-device tensor is a no-op.
The PR author has tested end-to-end inference (image VIT inference with a sample prompt), not just engine init.
Bug hunting system found no issues.

liulanze · 2026-06-08T19:32:59Z

@ZJY0516 Thanks for the review.

The entrypoints-integration-api-server-openai-part failure is unrelated, that it tests OpenAI HTTP API endpoints, not model internals, and showed a PASS→FAIL→PASS flake pattern within the same build.

FYI, I've rebased to the current latest main, to let the pipeline rerun.

mergify · 2026-06-08T19:35:56Z

Hi @liulanze, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

During profile_run the multimodal grid_thw arrives on CPU, so the cu_seqlens tensor built from it inherits the CPU device. Passing this CPU tensor into the FA3 vit attention path raises "RuntimeError: cu_seqlens_q must be on CUDA" and crashes engine init. Move cu_seqlens to self.device after construction so the FA3 wrapper receives a CUDA tensor regardless of where grid_thw lives. The sibling qwen3_vl model already routes cu_seqlens through MMEncoderAttention.maybe_recompute_cu_seqlens(..., device=self.device) for the same reason; this mirrors that guarantee minimally. Fixes vllm-project#44180 Signed-off-by: Lanze Liu <lanzetech@gmail.com>

mergify · 2026-06-08T21:20:35Z

Documentation preview: https://vllm--44264.org.readthedocs.build/en/44264/

liulanze · 2026-06-08T21:33:35Z

Related docs/readthedocs.org:vllm failure fix: #44929

…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com>

…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com>

liulanze requested review from sighingnow and vadiklyutiy as code owners June 2, 2026 00:17

mergify Bot added qwen Related to Qwen models bug Something isn't working labels Jun 2, 2026

ZJY0516 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 8, 2026

claude Bot reviewed Jun 8, 2026

View reviewed changes

liulanze force-pushed the fix/qwen3-omni-cu-seqlens-device branch from 884deb6 to 14242dd Compare June 8, 2026 19:30

liulanze force-pushed the fix/qwen3-omni-cu-seqlens-device branch from 14242dd to 6d965db Compare June 8, 2026 21:19

mergify Bot added the documentation Improvements or additions to documentation label Jun 8, 2026

liulanze force-pushed the fix/qwen3-omni-cu-seqlens-device branch from 6d965db to b4685ca Compare June 8, 2026 21:24

Isotr0py approved these changes Jun 9, 2026

View reviewed changes

Isotr0py enabled auto-merge (squash) June 9, 2026 01:22

vllm-bot merged commit 540aaf2 into vllm-project:main Jun 9, 2026
55 of 56 checks passed

Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026

[Bugfix][Model] Qwen3-Omni: move cu_seqlens to GPU before VIT attenti…

f813d24

…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com>

vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026

[Bugfix][Model] Qwen3-Omni: move cu_seqlens to GPU before VIT attenti…

26b61c2

…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026

[Bugfix][Model] Qwen3-Omni: move cu_seqlens to GPU before VIT attenti…

19c04a0

…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com>

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[Bugfix][Model] Qwen3-Omni: move cu_seqlens to GPU before VIT attenti…

7043d5d

…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com>

ohsono pushed a commit to ohsono/vllm that referenced this pull request Jul 3, 2026

[Bugfix][Model] Qwen3-Omni: move cu_seqlens to GPU before VIT attenti…

89200fe

…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix][Model] Qwen3-Omni: move cu_seqlens to GPU before VIT attention#44264

[Bugfix][Model] Qwen3-Omni: move cu_seqlens to GPU before VIT attention#44264
vllm-bot merged 1 commit into
vllm-project:mainfrom
liulanze:fix/qwen3-omni-cu-seqlens-device

liulanze commented Jun 2, 2026 •

edited by github-actions Bot

Loading

ZJY0516 commented Jun 8, 2026

claude Bot left a comment

liulanze commented Jun 8, 2026 •

edited

Loading

mergify Bot commented Jun 8, 2026

mergify Bot commented Jun 8, 2026

liulanze commented Jun 8, 2026 •

edited

Loading

Uh oh!

Labels

4 participants

Uh oh!

Uh oh!

Conversation

liulanze commented Jun 2, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

ZJY0516 commented Jun 8, 2026

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

liulanze commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mergify Bot commented Jun 8, 2026

mergify Bot commented Jun 8, 2026

liulanze commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Labels

4 participants

liulanze commented Jun 2, 2026 •

edited by github-actions Bot

Loading

liulanze commented Jun 8, 2026 •

edited

Loading

liulanze commented Jun 8, 2026 •

edited

Loading