Skip to content

[MM][Perf][CG] Support ViT full CUDA graph for glm4_1v image and video inference #40576

Merged
Isotr0py merged 11 commits into
vllm-project:mainfrom
grYe99:support_vit_cudagraph_glm4_1v
Jun 9, 2026
Merged

[MM][Perf][CG] Support ViT full CUDA graph for glm4_1v image and video inference #40576
Isotr0py merged 11 commits into
vllm-project:mainfrom
grYe99:support_vit_cudagraph_glm4_1v

Conversation

@grYe99

@grYe99 grYe99 commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

Purpose

Following #38175, this PR implements ViT CUDA graph support for glm4_1v models image and video inference . The implementation draws references from #35963 (image) and #38061 (video)

  1. Functional Test
  2. Benchmark in some scenarios:
  • no DP VIT + eager vs no DP VIT + cuda graph.
  • DP VIT + eager vs DP VIT + cuda graph.
  1. Bench Serve
  2. Accuracy Verify

Test Plan

1. Functional Test
python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality image
python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality video
pytest tests/models/multimodal/generation/test_vit_cudagraph.py -k "glm4_1v"
2. Benchmark
# Image
# Single GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.1V-9B-Thinking \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 1): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 1, "video": 0}' \
--num-prompts 1000 \
--seed 42 \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Multi GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.1V-9B-Thinking \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 1): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 2, "video": 0}' \
--num-prompts 1000 \
--seed 42 \
--tensor-parallel-size 2 \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Video
# Single GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.6V-Flash \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 1}' \
--num-prompts 1000 \
--seed 42 \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Multi GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.6V-Flash \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 2}' \
--num-prompts 1000 \
--seed 42 \
--tensor-parallel-size 2 \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'
3. Bench Serve
python -m vllm.entrypoints.cli.main serve zai-org/GLM-4.6V-Flash \
--max-model-len 4096 \
--trust-remote-code \
--tensor-parallel-size 2 \
--mm-encoder-tp-mode data  \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

vllm bench serve   \
--backend openai-chat   \
--model zai-org/GLM-4.6V-Flash  \
 --base-url http://localhost:8000   \
--endpoint /v1/chat/completions   \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 2}' \
--num-prompts 100 \
--seed 42 
4. Accuracy Verify
lmms-eval \
    --model vllm \
    --model_args "model="zai-org/GLM-4.1V-9B-Thinking" "\
    --tasks mmstar \
    --batch_size 128

Test Result

1.Functional Test
# python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality image
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, I think. The sky is clear blue, and the cherry blossoms are in the foreground, framing the tower. So the content
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, right? The sky is clear blue, and the cherry blossoms are in the foreground, framing the tower. So the content includes
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, right? The sky is clear blue, and the cherry blossoms are in full bloom, framing the tower. So the content includes
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, right? The sky is clear blue, and the cherry blossoms are in the foreground, framing the tower. So the content includes
--------------------------------------------------

# python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality video
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a young child wearing oversized glasses, which are way too big for their face. That's a common humorous setup—kids in adult-sized items look comical. Then, the child is engrossed in a book
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a baby wearing oversized glasses, which are way too big for a baby's face. Babies are usually cute, but the mismatch between the big glasses and the baby's small face creates a humorous contrast. Also, the
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a young child with oversized glasses, which are probably not meant for them, so that's a funny contrast. The child is "reading" a book, but maybe the glasses make them look like an adult trying to
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a young child wearing oversized glasses, which are way too big for their face. That's a common humorous setup—kids in adult accessories. Then, the child is engrossed in a book, maybe pretending to
--------------------------------------------------
  1. Benchmark (encoder_forward_ms)
  • Image Inference:

Single GPU (zai-org/GLM-4.1V-9B-Thinking, 1xRTX4090, random-mm, 1000 prompts):

Backend Mean P99
FLASH_ATTN +2.88% (4.17ms -> 4.05ms) +48.34% (9.33ms -> 4.82ms)

Multi GPU (zai-org/GLM-4.1V-9B-Thinking, 2xRTX4090, random-mm, 1000 prompts):

Backend Mean P99
FLASH_ATTN +62.73% (6.01ms -> 2.24ms) +62.43% (24.73ms -> 9.29ms)
  • Video Inference:

Single GPU (zai-org/GLM-4.6V-Flash,, 1xRTX4090, random-mm, 1000 prompts):

Backend Mean P99
FLASH_ATTN +37.47% (7.26ms -> 4.54ms) +63.07% (25.78ms -> 9.52ms)

Multi GPU (zai-org/GLM-4.6V-Flash, 2xRTX4090, random-mm, 1000 prompts):

Backend Mean P99
FLASH_ATTN +66.77% (9.51ms -> 3.16ms) +62.44% (26.17ms -> 9.83ms)
  1. Bench Serve
    eager
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  28.34     
Total input tokens:                      110200    
Total generated tokens:                  12800     
Request throughput (req/s):              3.53      
Output token throughput (tok/s):         451.65    
Peak output token throughput (tok/s):    4100.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4340.10   
---------------Time to First Token----------------
Mean TTFT (ms):                          18770.72  
Median TTFT (ms):                        19287.97  
P99 TTFT (ms):                           25433.90  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          71.09     
Median TPOT (ms):                        67.74     
P99 TPOT (ms):                           187.21    
---------------Inter-token Latency----------------
Mean ITL (ms):                           70.53     
Median ITL (ms):                         24.00     
P99 ITL (ms):                            236.64    
==================================================

cuda graph

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  26.45     
Total input tokens:                      110200    
Total generated tokens:                  12800     
Request throughput (req/s):              3.78      
Output token throughput (tok/s):         483.91    
Peak output token throughput (tok/s):    4200.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4650.05   
---------------Time to First Token----------------
Mean TTFT (ms):                          16884.68  
Median TTFT (ms):                        17274.28  
P99 TTFT (ms):                           23563.19  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          71.04     
Median TPOT (ms):                        68.08     
P99 TPOT (ms):                           170.80    
---------------Inter-token Latency----------------
Mean ITL (ms):                           70.48     
Median ITL (ms):                         23.95     
P99 ITL (ms):                            239.59    
==================================================

4.Accuracy Verify
PR:

|Tasks |Filter|n-shot|        Metric         |   |Value |   |Stderr|Stderr_CLT|
|------|------|-----:|-----------------------|---|-----:|---|------|---------:|
|mmstar|none  |     0|average                |↑  |0.6922|±  |N/A   |    0.0120|
|mmstar|none  |     0|coarse perception      |↑  |0.6773|±  |N/A   |    0.0297|
|mmstar|none  |     0|fine-grained perception|↑  |0.6530|±  |N/A   |    0.0310|
|mmstar|none  |     0|instance reasoning     |↑  |0.7169|±  |N/A   |    0.0289|
|mmstar|none  |     0|logical reasoning      |↑  |0.7067|±  |N/A   |    0.0278|
|mmstar|none  |     0|math                   |↑  |0.8273|±  |N/A   |    0.0246|
|mmstar|none  |     0|science & technology   |↑  |0.5722|±  |N/A   |    0.0314|

main:

|Tasks |Filter|n-shot|        Metric         |   |Value |   |Stderr|Stderr_CLT|
|------|------|-----:|-----------------------|---|-----:|---|------|---------:|
|mmstar|none  |     0|average                |↑  |0.6886|±  |N/A   |    0.0121|
|mmstar|none  |     0|coarse perception      |↑  |0.6749|±  |N/A   |    0.0298|
|mmstar|none  |     0|fine-grained perception|↑  |0.6388|±  |N/A   |    0.0311|
|mmstar|none  |     0|instance reasoning     |↑  |0.7350|±  |N/A   |    0.0283|
|mmstar|none  |     0|logical reasoning      |↑  |0.7257|±  |N/A   |    0.0274|
|mmstar|none  |     0|math                   |↑  |0.8376|±  |N/A   |    0.0235|
|mmstar|none  |     0|science & technology   |↑  |0.5198|±  |N/A   |    0.0317|

Note

Glm4vVisionAttention not support --mm-encoder-attn-backend FLASHINFER yet, thus only test in FLASH_ATTN. It will be supported in another PR.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@grYe99

grYe99 commented Apr 22, 2026

Copy link
Copy Markdown
Contributor Author

@claude review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements CUDA graph support for the GLM-4V model by introducing a fused Triton kernel for position-embedding interpolation and refactoring the vision encoder's metadata preparation. Key changes include the addition of a native PyTorch fallback for interpolation, the implementation of the SupportsEncoderCudaGraph protocol, and optimizations to rotary position ID generation using lru_cache. Review feedback identified a potential regression in model accuracy due to the switch from bicubic to bilinear interpolation and highlighted a lack of error handling for empty input lists in the metadata preparation logic.

Comment thread vllm/model_executor/models/glm4_1v.py Outdated
Comment thread vllm/model_executor/models/glm4_1v.py
Comment thread vllm/model_executor/models/glm4_1v.py
@grYe99 grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch from a9b7652 to 2ffaca2 Compare April 25, 2026 08:50
@mergify

mergify Bot commented Apr 25, 2026

Copy link
Copy Markdown
Contributor
@mergify mergify Bot added the documentation Improvements or additions to documentation label Apr 25, 2026
@mergify mergify Bot added the multi-modality Related to multi-modality (#4194) label Apr 25, 2026
@grYe99 grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch 4 times, most recently from ae6631e to d824bba Compare April 25, 2026 14:44
@grYe99

grYe99 commented Apr 25, 2026

Copy link
Copy Markdown
Contributor Author

@DarkLight1337 Hi, could you give this PR a 'ready' label to run CI tests? Thanks!

@grYe99 grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch 2 times, most recently from b05a230 to 3a67b98 Compare May 6, 2026 14:28
@grYe99

grYe99 commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

@shen-shanshan @b-mu Hi, could you help review this PR when you have time? I recently updated code to support "auto-infer compilation-config" and passed the following tests:

python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality image
python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality video
pytest tests/models/multimodal/generation/test_vit_cudagraph.py -k "glm4_1v"

and serve with --compilation-config '{"cudagraph_mm_encoder": true}'
Any suggestions are welcome!

Comment thread vllm/model_executor/models/glm4_1v.py Outdated
@grYe99

grYe99 commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

@Isotr0py Hi, Is there anything else needed for this PR to move forward? Happy to make any changes. Thanks!

@Isotr0py

Copy link
Copy Markdown
Member

Sry for missing this! Can you provide some multimodal accuracy benchmark results like MMMU etc?

@grYe99 grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch from cdbf261 to 98e3143 Compare May 21, 2026 02:09
@grYe99 grYe99 closed this May 21, 2026
@grYe99 grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch from 98e3143 to 5774aae Compare May 21, 2026 02:13
@github-project-automation github-project-automation Bot moved this to Done in NVIDIA May 21, 2026
@grYe99 grYe99 reopened this May 21, 2026
@mergify

mergify Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor
grYe99 added 2 commits May 26, 2026 17:11
Signed-off-by: grYe99 <guorongye99@gmail.com>
This reverts commit dee6875.

Signed-off-by: grYe99 <guorongye99@gmail.com>
@grYe99 grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch from 612619c to f1e807d Compare May 26, 2026 09:11
grYe99 added 2 commits May 26, 2026 18:09
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
@grYe99

grYe99 commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

@shen-shanshan All your review comments have been resolved. Ready for re-review.

@shen-shanshan shen-shanshan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with my pass. Also CC @Isotr0py

@grYe99

grYe99 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@Isotr0py Gentle ping. This PR is ready, all comments are resolved. PTAL. Thanks!

@Isotr0py Isotr0py left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@github-project-automation github-project-automation Bot moved this from Done to Ready in NVIDIA Jun 5, 2026
@Isotr0py Isotr0py requested a review from AndreasKaratzas as a code owner June 5, 2026 17:32
@Isotr0py Isotr0py enabled auto-merge (squash) June 5, 2026 17:32
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 5, 2026
Signed-off-by: grYe99 <guorongye99@gmail.com>
auto-merge was automatically disabled June 6, 2026 02:21

Head branch was pushed to by a user without write access

@grYe99

grYe99 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

@Isotr0py The previous CI failures (multi-modal-models-standard-2 and 4) were caused by interface changes in #42288 and OOM issues. I've fixed them in commit cff5c90 by:

  1. Updating the code to align with Adjust design around encoder_cudagraph_forward #42288.
  2. Using dummy weights and dummy_hf_overrides like step3-vl ([MM][CG] Enable encoder Cudagraph for Step3VL #42224).

Those tests now pass.

However, a new failure has appeared in cpu-language-generation-and-pooling-model-tests (build link). It seems unrelated to this PR – it's a Rust compilation error (vllm-cmd).

How would you like me to proceed with this?

@grYe99

grYe99 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

@Isotr0py CI tests passed. Could you re-enable auto-merge?

@Isotr0py Isotr0py merged commit 9f153aa into vllm-project:main Jun 9, 2026
59 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 9, 2026
ekagra-ranjan pushed a commit to ekagra-ranjan/vllm that referenced this pull request Jun 9, 2026
…o inference (vllm-project#40576)

Signed-off-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026
…o inference (vllm-project#40576)

Signed-off-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
…o inference (vllm-project#40576)

Signed-off-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026
…o inference (vllm-project#40576)

Signed-off-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…o inference (vllm-project#40576)

Signed-off-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: divineearthly <divineearthly@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…o inference (vllm-project#40576)

Signed-off-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…o inference (vllm-project#40576)

Signed-off-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
ohsono pushed a commit to ohsono/vllm that referenced this pull request Jul 3, 2026
…o inference (vllm-project#40576)

Signed-off-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) nvidia ready ONLY add when PR is ready to merge/full CI is needed

3 participants