[MM][Perf][CG] Support ViT full CUDA graph for glm4_1v image and video inference #40576
Conversation
|
@claude review |
There was a problem hiding this comment.
Code Review
This pull request implements CUDA graph support for the GLM-4V model by introducing a fused Triton kernel for position-embedding interpolation and refactoring the vision encoder's metadata preparation. Key changes include the addition of a native PyTorch fallback for interpolation, the implementation of the SupportsEncoderCudaGraph protocol, and optimizations to rotary position ID generation using lru_cache. Review feedback identified a potential regression in model accuracy due to the switch from bicubic to bilinear interpolation and highlighted a lack of error handling for empty input lists in the metadata preparation logic.
a9b7652 to
2ffaca2
Compare
|
Documentation preview: https://vllm--40576.org.readthedocs.build/en/40576/ |
ae6631e to
d824bba
Compare
|
@DarkLight1337 Hi, could you give this PR a 'ready' label to run CI tests? Thanks! |
b05a230 to
3a67b98
Compare
|
@shen-shanshan @b-mu Hi, could you help review this PR when you have time? I recently updated code to support "auto-infer compilation-config" and passed the following tests: and serve with |
|
@Isotr0py Hi, Is there anything else needed for this PR to move forward? Happy to make any changes. Thanks! |
|
Sry for missing this! Can you provide some multimodal accuracy benchmark results like MMMU etc? |
cdbf261 to
98e3143
Compare
98e3143 to
5774aae
Compare
|
Documentation preview: https://vllm--40576.org.readthedocs.build/en/40576/ |
Signed-off-by: grYe99 <guorongye99@gmail.com>
This reverts commit dee6875. Signed-off-by: grYe99 <guorongye99@gmail.com>
612619c to
f1e807d
Compare
Signed-off-by: grYe99 <guorongye99@gmail.com>
|
@shen-shanshan All your review comments have been resolved. Ready for re-review. |
shen-shanshan
left a comment
There was a problem hiding this comment.
Done with my pass. Also CC @Isotr0py
|
@Isotr0py Gentle ping. This PR is ready, all comments are resolved. PTAL. Thanks! |
Signed-off-by: grYe99 <guorongye99@gmail.com>
Head branch was pushed to by a user without write access
|
@Isotr0py The previous CI failures (multi-modal-models-standard-2 and 4) were caused by interface changes in #42288 and OOM issues. I've fixed them in commit cff5c90 by:
Those tests now pass. However, a new failure has appeared in cpu-language-generation-and-pooling-model-tests (build link). It seems unrelated to this PR – it's a Rust compilation error (vllm-cmd). How would you like me to proceed with this? |
|
@Isotr0py CI tests passed. Could you re-enable auto-merge? |
…o inference (vllm-project#40576) Signed-off-by: grYe99 <guorongye99@gmail.com> Co-authored-by: grYe99 <guorongye99@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
…o inference (vllm-project#40576) Signed-off-by: grYe99 <guorongye99@gmail.com> Co-authored-by: grYe99 <guorongye99@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
…o inference (vllm-project#40576) Signed-off-by: grYe99 <guorongye99@gmail.com> Co-authored-by: grYe99 <guorongye99@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…o inference (vllm-project#40576) Signed-off-by: grYe99 <guorongye99@gmail.com> Co-authored-by: grYe99 <guorongye99@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…o inference (vllm-project#40576) Signed-off-by: grYe99 <guorongye99@gmail.com> Co-authored-by: grYe99 <guorongye99@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: divineearthly <divineearthly@gmail.com>
…o inference (vllm-project#40576) Signed-off-by: grYe99 <guorongye99@gmail.com> Co-authored-by: grYe99 <guorongye99@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…o inference (vllm-project#40576) Signed-off-by: grYe99 <guorongye99@gmail.com> Co-authored-by: grYe99 <guorongye99@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…o inference (vllm-project#40576) Signed-off-by: grYe99 <guorongye99@gmail.com> Co-authored-by: grYe99 <guorongye99@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Purpose
Following #38175, this PR implements ViT CUDA graph support for glm4_1v models image and video inference . The implementation draws references from #35963 (image) and #38061 (video)
Test Plan
1. Functional Test
2. Benchmark
3. Bench Serve
4. Accuracy Verify
Test Result
1.Functional Test
Single GPU (zai-org/GLM-4.1V-9B-Thinking, 1xRTX4090, random-mm, 1000 prompts):
Multi GPU (zai-org/GLM-4.1V-9B-Thinking, 2xRTX4090, random-mm, 1000 prompts):
Single GPU (zai-org/GLM-4.6V-Flash,, 1xRTX4090, random-mm, 1000 prompts):
Multi GPU (zai-org/GLM-4.6V-Flash, 2xRTX4090, random-mm, 1000 prompts):
eager
cuda graph
4.Accuracy Verify
PR:
main:
Note
Glm4vVisionAttention not support
--mm-encoder-attn-backend FLASHINFERyet, thus only test in FLASH_ATTN. It will be supported in another PR.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.