[MM][Perf][CG] Support ViT full CUDA graph for glm4_1v image and video inference by grYe99 · Pull Request #40576 · vllm-project/vllm

grYe99 · 2026-04-22T03:25:13Z

Purpose

Following #38175, this PR implements ViT CUDA graph support for glm4_1v models image and video inference . The implementation draws references from #35963 (image) and #38061 (video)

Functional Test
Benchmark in some scenarios:

no DP VIT + eager vs no DP VIT + cuda graph.
DP VIT + eager vs DP VIT + cuda graph.

Bench Serve
Accuracy Verify

Test Plan

1. Functional Test

python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality image
python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality video
pytest tests/models/multimodal/generation/test_vit_cudagraph.py -k "glm4_1v"

2. Benchmark

# Image
# Single GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.1V-9B-Thinking \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 1): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 1, "video": 0}' \
--num-prompts 1000 \
--seed 42 \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Multi GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.1V-9B-Thinking \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 1): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 2, "video": 0}' \
--num-prompts 1000 \
--seed 42 \
--tensor-parallel-size 2 \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Video
# Single GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.6V-Flash \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 1}' \
--num-prompts 1000 \
--seed 42 \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Multi GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.6V-Flash \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 2}' \
--num-prompts 1000 \
--seed 42 \
--tensor-parallel-size 2 \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

3. Bench Serve

python -m vllm.entrypoints.cli.main serve zai-org/GLM-4.6V-Flash \
--max-model-len 4096 \
--trust-remote-code \
--tensor-parallel-size 2 \
--mm-encoder-tp-mode data  \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

vllm bench serve   \
--backend openai-chat   \
--model zai-org/GLM-4.6V-Flash  \
 --base-url http://localhost:8000   \
--endpoint /v1/chat/completions   \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 2}' \
--num-prompts 100 \
--seed 42

4. Accuracy Verify

lmms-eval \
    --model vllm \
    --model_args "model="zai-org/GLM-4.1V-9B-Thinking" "\
    --tasks mmstar \
    --batch_size 128

Test Result

1.Functional Test

# python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality image
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, I think. The sky is clear blue, and the cherry blossoms are in the foreground, framing the tower. So the content
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, right? The sky is clear blue, and the cherry blossoms are in the foreground, framing the tower. So the content includes
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, right? The sky is clear blue, and the cherry blossoms are in full bloom, framing the tower. So the content includes
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, right? The sky is clear blue, and the cherry blossoms are in the foreground, framing the tower. So the content includes
--------------------------------------------------

# python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality video
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a young child wearing oversized glasses, which are way too big for their face. That's a common humorous setup—kids in adult-sized items look comical. Then, the child is engrossed in a book
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a baby wearing oversized glasses, which are way too big for a baby's face. Babies are usually cute, but the mismatch between the big glasses and the baby's small face creates a humorous contrast. Also, the
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a young child with oversized glasses, which are probably not meant for them, so that's a funny contrast. The child is "reading" a book, but maybe the glasses make them look like an adult trying to
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a young child wearing oversized glasses, which are way too big for their face. That's a common humorous setup—kids in adult accessories. Then, the child is engrossed in a book, maybe pretending to
--------------------------------------------------

Benchmark (encoder_forward_ms)

Image Inference:

Single GPU (zai-org/GLM-4.1V-9B-Thinking, 1xRTX4090, random-mm, 1000 prompts):

Backend	Mean	P99
FLASH_ATTN	+2.88% (4.17ms -> 4.05ms)	+48.34% (9.33ms -> 4.82ms)

Multi GPU (zai-org/GLM-4.1V-9B-Thinking, 2xRTX4090, random-mm, 1000 prompts):

Backend	Mean	P99
FLASH_ATTN	+62.73% (6.01ms -> 2.24ms)	+62.43% (24.73ms -> 9.29ms)

Video Inference:

Single GPU (zai-org/GLM-4.6V-Flash,, 1xRTX4090, random-mm, 1000 prompts):

Backend	Mean	P99
FLASH_ATTN	+37.47% (7.26ms -> 4.54ms)	+63.07% (25.78ms -> 9.52ms)

Multi GPU (zai-org/GLM-4.6V-Flash, 2xRTX4090, random-mm, 1000 prompts):

Backend	Mean	P99
FLASH_ATTN	+66.77% (9.51ms -> 3.16ms)	+62.44% (26.17ms -> 9.83ms)

Bench Serve
eager

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  28.34     
Total input tokens:                      110200    
Total generated tokens:                  12800     
Request throughput (req/s):              3.53      
Output token throughput (tok/s):         451.65    
Peak output token throughput (tok/s):    4100.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4340.10   
---------------Time to First Token----------------
Mean TTFT (ms):                          18770.72  
Median TTFT (ms):                        19287.97  
P99 TTFT (ms):                           25433.90  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          71.09     
Median TPOT (ms):                        67.74     
P99 TPOT (ms):                           187.21    
---------------Inter-token Latency----------------
Mean ITL (ms):                           70.53     
Median ITL (ms):                         24.00     
P99 ITL (ms):                            236.64    
==================================================

cuda graph

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  26.45     
Total input tokens:                      110200    
Total generated tokens:                  12800     
Request throughput (req/s):              3.78      
Output token throughput (tok/s):         483.91    
Peak output token throughput (tok/s):    4200.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4650.05   
---------------Time to First Token----------------
Mean TTFT (ms):                          16884.68  
Median TTFT (ms):                        17274.28  
P99 TTFT (ms):                           23563.19  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          71.04     
Median TPOT (ms):                        68.08     
P99 TPOT (ms):                           170.80    
---------------Inter-token Latency----------------
Mean ITL (ms):                           70.48     
Median ITL (ms):                         23.95     
P99 ITL (ms):                            239.59    
==================================================

4.Accuracy Verify
PR:

|Tasks |Filter|n-shot|        Metric         |   |Value |   |Stderr|Stderr_CLT|
|------|------|-----:|-----------------------|---|-----:|---|------|---------:|
|mmstar|none  |     0|average                |↑  |0.6922|±  |N/A   |    0.0120|
|mmstar|none  |     0|coarse perception      |↑  |0.6773|±  |N/A   |    0.0297|
|mmstar|none  |     0|fine-grained perception|↑  |0.6530|±  |N/A   |    0.0310|
|mmstar|none  |     0|instance reasoning     |↑  |0.7169|±  |N/A   |    0.0289|
|mmstar|none  |     0|logical reasoning      |↑  |0.7067|±  |N/A   |    0.0278|
|mmstar|none  |     0|math                   |↑  |0.8273|±  |N/A   |    0.0246|
|mmstar|none  |     0|science & technology   |↑  |0.5722|±  |N/A   |    0.0314|

main:

|Tasks |Filter|n-shot|        Metric         |   |Value |   |Stderr|Stderr_CLT|
|------|------|-----:|-----------------------|---|-----:|---|------|---------:|
|mmstar|none  |     0|average                |↑  |0.6886|±  |N/A   |    0.0121|
|mmstar|none  |     0|coarse perception      |↑  |0.6749|±  |N/A   |    0.0298|
|mmstar|none  |     0|fine-grained perception|↑  |0.6388|±  |N/A   |    0.0311|
|mmstar|none  |     0|instance reasoning     |↑  |0.7350|±  |N/A   |    0.0283|
|mmstar|none  |     0|logical reasoning      |↑  |0.7257|±  |N/A   |    0.0274|
|mmstar|none  |     0|math                   |↑  |0.8376|±  |N/A   |    0.0235|
|mmstar|none  |     0|science & technology   |↑  |0.5198|±  |N/A   |    0.0317|

Note

Glm4vVisionAttention not support --mm-encoder-attn-backend FLASHINFER yet, thus only test in FLASH_ATTN. It will be supported in another PR.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

grYe99 · 2026-04-22T03:26:17Z

@claude review

gemini-code-assist

Code Review

This pull request implements CUDA graph support for the GLM-4V model by introducing a fused Triton kernel for position-embedding interpolation and refactoring the vision encoder's metadata preparation. Key changes include the addition of a native PyTorch fallback for interpolation, the implementation of the SupportsEncoderCudaGraph protocol, and optimizations to rotary position ID generation using lru_cache. Review feedback identified a potential regression in model accuracy due to the switch from bicubic to bilinear interpolation and highlighted a lack of error handling for empty input lists in the metadata preparation logic.

mergify · 2026-04-25T09:15:52Z

Documentation preview: https://vllm--40576.org.readthedocs.build/en/40576/

grYe99 · 2026-04-25T14:48:00Z

@DarkLight1337 Hi, could you give this PR a 'ready' label to run CI tests? Thanks!

grYe99 · 2026-05-06T14:52:02Z

@shen-shanshan @b-mu Hi, could you help review this PR when you have time? I recently updated code to support "auto-infer compilation-config" and passed the following tests:

python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality image
python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality video
pytest tests/models/multimodal/generation/test_vit_cudagraph.py -k "glm4_1v"

and serve with --compilation-config '{"cudagraph_mm_encoder": true}'
Any suggestions are welcome!

grYe99 · 2026-05-20T02:25:07Z

@Isotr0py Hi, Is there anything else needed for this PR to move forward? Happy to make any changes. Thanks!

Isotr0py · 2026-05-20T02:48:42Z

Sry for missing this! Can you provide some multimodal accuracy benchmark results like MMMU etc?

mergify · 2026-05-21T02:33:56Z

Documentation preview: https://vllm--40576.org.readthedocs.build/en/40576/

Signed-off-by: grYe99 <guorongye99@gmail.com>

This reverts commit dee6875. Signed-off-by: grYe99 <guorongye99@gmail.com>

Signed-off-by: grYe99 <guorongye99@gmail.com>

grYe99 · 2026-05-26T13:03:49Z

@shen-shanshan All your review comments have been resolved. Ready for re-review.

shen-shanshan

Done with my pass. Also CC @Isotr0py

grYe99 · 2026-06-05T03:20:45Z

@Isotr0py Gentle ping. This PR is ready, all comments are resolved. PTAL. Thanks!

Isotr0py

LGTM, thanks!

Signed-off-by: grYe99 <guorongye99@gmail.com>

grYe99 · 2026-06-06T06:09:08Z

@Isotr0py The previous CI failures (multi-modal-models-standard-2 and 4) were caused by interface changes in #42288 and OOM issues. I've fixed them in commit cff5c90 by:

Updating the code to align with Adjust design around encoder_cudagraph_forward #42288.
Using dummy weights and dummy_hf_overrides like step3-vl ([MM][CG] Enable encoder Cudagraph for Step3VL #42224).

Those tests now pass.

However, a new failure has appeared in cpu-language-generation-and-pooling-model-tests (build link). It seems unrelated to this PR – it's a Rust compilation error (vllm-cmd).

How would you like me to proceed with this?

grYe99 · 2026-06-08T09:11:04Z

@Isotr0py CI tests passed. Could you re-enable auto-merge?

…o inference (vllm-project#40576) Signed-off-by: grYe99 <guorongye99@gmail.com> Co-authored-by: grYe99 <guorongye99@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

…o inference (vllm-project#40576) Signed-off-by: grYe99 <guorongye99@gmail.com> Co-authored-by: grYe99 <guorongye99@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

…o inference (vllm-project#40576) Signed-off-by: grYe99 <guorongye99@gmail.com> Co-authored-by: grYe99 <guorongye99@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

…o inference (vllm-project#40576) Signed-off-by: grYe99 <guorongye99@gmail.com> Co-authored-by: grYe99 <guorongye99@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: divineearthly <divineearthly@gmail.com>

…o inference (vllm-project#40576) Signed-off-by: grYe99 <guorongye99@gmail.com> Co-authored-by: grYe99 <guorongye99@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

claude Bot reviewed Apr 22, 2026

View reviewed changes

mergify Bot added the nvidia label Apr 22, 2026

github-project-automation Bot added this to NVIDIA Apr 22, 2026

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread vllm/model_executor/models/glm4_1v.py Outdated

Comment thread vllm/model_executor/models/glm4_1v.py

shen-shanshan mentioned this pull request Apr 22, 2026

[RFC]: Support ViT Full CUDA Graph (Tracker) #38175

Open

38 tasks

shen-shanshan reviewed Apr 25, 2026

View reviewed changes

Comment thread vllm/model_executor/models/glm4_1v.py

grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch from a9b7652 to 2ffaca2 Compare April 25, 2026 08:50

mergify Bot added the documentation Improvements or additions to documentation label Apr 25, 2026

grYe99 requested review from DarkLight1337 and ywang96 as code owners April 25, 2026 09:17

mergify Bot added the multi-modality Related to multi-modality (#4194) label Apr 25, 2026

grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch 4 times, most recently from ae6631e to d824bba Compare April 25, 2026 14:44

grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch 2 times, most recently from b05a230 to 3a67b98 Compare May 6, 2026 14:28

Isotr0py reviewed May 8, 2026

View reviewed changes

Comment thread vllm/model_executor/models/glm4_1v.py Outdated

grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch from cdbf261 to 98e3143 Compare May 21, 2026 02:09

grYe99 closed this May 21, 2026

grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch from 98e3143 to 5774aae Compare May 21, 2026 02:13

github-project-automation Bot moved this to Done in NVIDIA May 21, 2026

grYe99 reopened this May 21, 2026

grYe99 added 2 commits May 26, 2026 17:11

fit auto infer

ca193ae

Signed-off-by: grYe99 <guorongye99@gmail.com>

Revert "fit auto infer"

f1e807d

This reverts commit dee6875. Signed-off-by: grYe99 <guorongye99@gmail.com>

grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch from 612619c to f1e807d Compare May 26, 2026 09:11

grYe99 added 2 commits May 26, 2026 18:09

address review: simplify interfaces, add note, EVS pruning

be3040e

Signed-off-by: grYe99 <guorongye99@gmail.com>

typo

2c11cf1

Signed-off-by: grYe99 <guorongye99@gmail.com>

shen-shanshan approved these changes May 27, 2026

View reviewed changes

Isotr0py approved these changes Jun 5, 2026

View reviewed changes

github-project-automation Bot moved this from Done to Ready in NVIDIA Jun 5, 2026

Merge branch 'main' into support_vit_cudagraph_glm4_1v

b5e5c7b

Isotr0py requested a review from AndreasKaratzas as a code owner June 5, 2026 17:32

Isotr0py enabled auto-merge (squash) June 5, 2026 17:32

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 5, 2026

fix CI test

cff5c90

Signed-off-by: grYe99 <guorongye99@gmail.com>

auto-merge was automatically disabled June 6, 2026 02:21
Head branch was pushed to by a user without write access

Merge branch 'main' into support_vit_cudagraph_glm4_1v

9a25c02

Merge branch 'main' into support_vit_cudagraph_glm4_1v

c8c9cf8

Isotr0py merged commit 9f153aa into vllm-project:main Jun 9, 2026
59 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MM][Perf][CG] Support ViT full CUDA graph for glm4_1v image and video inference #40576

[MM][Perf][CG] Support ViT full CUDA graph for glm4_1v image and video inference #40576
Isotr0py merged 11 commits into
vllm-project:mainfrom
grYe99:support_vit_cudagraph_glm4_1v

grYe99 commented Apr 22, 2026 •

edited

Loading

claude Bot left a comment

grYe99 commented Apr 22, 2026

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Apr 25, 2026

grYe99 commented Apr 25, 2026 •

edited

Loading

grYe99 commented May 6, 2026

Uh oh!

grYe99 commented May 20, 2026

Isotr0py commented May 20, 2026

mergify Bot commented May 21, 2026

grYe99 commented May 26, 2026

shen-shanshan left a comment

grYe99 commented Jun 5, 2026

Isotr0py left a comment

grYe99 commented Jun 6, 2026

grYe99 commented Jun 8, 2026

Uh oh!

Labels

3 participants

Uh oh!

Uh oh!

Conversation

grYe99 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Note

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

grYe99 commented Apr 22, 2026

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Apr 25, 2026

grYe99 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

grYe99 commented May 6, 2026

Uh oh!

grYe99 commented May 20, 2026

Isotr0py commented May 20, 2026

mergify Bot commented May 21, 2026

grYe99 commented May 26, 2026

shen-shanshan left a comment

Choose a reason for hiding this comment

grYe99 commented Jun 5, 2026

Isotr0py left a comment

Choose a reason for hiding this comment

grYe99 commented Jun 6, 2026

grYe99 commented Jun 8, 2026

Uh oh!

Labels

3 participants

grYe99 commented Apr 22, 2026 •

edited

Loading

grYe99 commented Apr 25, 2026 •

edited

Loading