[Bugfix][Model] Qwen3-Omni: move cu_seqlens to GPU before VIT attention#44264
Conversation
|
@claude review |
There was a problem hiding this comment.
LGTM — minimal targeted device fix for FA3 VIT attention, mirrors existing qwen3_vl pattern.
Extended reasoning...
Overview
This PR adds a single line (cu_seqlens = cu_seqlens.to(self.device, non_blocking=True)) in Qwen3Omni_VisionTransformer.forward in vllm/model_executor/models/qwen3_omni_moe_thinker.py. The fix addresses a crash during profile_run where grid_thw arrives on CPU, causing the derived cu_seqlens to also be on CPU, which then fails when passed to FA3 (cu_seqlens_q must be on CUDA).
Security risks
None — this is a tensor device placement change, no auth/crypto/permission surface.
Level of scrutiny
Low. This is a straightforward bugfix that fixes engine initialization crash for a specific model. The fix is mechanical (move a tensor to the correct device), follows an existing pattern in the sibling qwen3_vl model, and is contained to a single forward pass. The author provided clear before/after repro logs from a real H100 run.
Other factors
- The change is additive and idempotent — calling
.to(device)on an already-on-device tensor is a no-op. - The PR author has tested end-to-end inference (image VIT inference with a sample prompt), not just engine init.
- Bug hunting system found no issues.
884deb6 to
14242dd
Compare
|
@ZJY0516 Thanks for the review. The FYI, I've rebased to the current latest main, to let the pipeline rerun. |
|
Hi @liulanze, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
During profile_run the multimodal grid_thw arrives on CPU, so the cu_seqlens tensor built from it inherits the CPU device. Passing this CPU tensor into the FA3 vit attention path raises "RuntimeError: cu_seqlens_q must be on CUDA" and crashes engine init. Move cu_seqlens to self.device after construction so the FA3 wrapper receives a CUDA tensor regardless of where grid_thw lives. The sibling qwen3_vl model already routes cu_seqlens through MMEncoderAttention.maybe_recompute_cu_seqlens(..., device=self.device) for the same reason; this mirrors that guarantee minimally. Fixes vllm-project#44180 Signed-off-by: Lanze Liu <lanzetech@gmail.com>
14242dd to
6d965db
Compare
|
Documentation preview: https://vllm--44264.org.readthedocs.build/en/44264/ |
6d965db to
b4685ca
Compare
|
Related |
…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com>
…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com>
…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com> Signed-off-by: divineearthly <divineearthly@gmail.com>
…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com>
…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com>
…on (vllm-project#44264) Signed-off-by: Lanze Liu <lanzetech@gmail.com>
Purpose
Fixes #44180.
During
profile_runthe multimodalgrid_thwarrives on CPU, so thecu_seqlenstensor built from it inherits the CPU device. Passing this CPU tensor into the FA3 vit attention path raisesRuntimeError: cu_seqlens_q must be on CUDAand crashes engine init when loadingQwen/Qwen3-Omni-30B-A3B-Thinking(and the Instruct variant).Move
cu_seqlenstoself.deviceafter construction so the FA3 wrapper receives a CUDA tensor regardless of wheregrid_thwlives. The siblingqwen3_vlmodel already routescu_seqlensthroughMMEncoderAttention.maybe_recompute_cu_seqlens(..., device=self.device)for the same reason; this PR mirrors that guarantee minimally forqwen3_omni_moe_thinker.Why this is not a duplicate:
gh pr list --repo vllm-project/vllm --state open --search "qwen3_omni"returns 0 open PRs.Test Plan
Reproduce and verify on a single H100 80 GB (CUDA 13, driver 580.105.08), matching the reporter's environment.
Repro script (matches the failing config in #44180):
Lint:
Test Result
Before patch (crashes during profile_run):
After patch:
End-to-end image VIT inference (224×224 solid-red PIL image, "Color shown? One word."):
Lint:
All checks passed!and1 file already formatted.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.