Skip to content

[Kernel][Performance] Add FlashInfer cutedsl NVFP4 GEMM backend#42235

Merged
mgoin merged 4 commits into
vllm-project:mainfrom
mmangkad:add-flashinfer-cutedsl-nvfp4
Jun 22, 2026
Merged

[Kernel][Performance] Add FlashInfer cutedsl NVFP4 GEMM backend#42235
mgoin merged 4 commits into
vllm-project:mainfrom
mmangkad:add-flashinfer-cutedsl-nvfp4

Conversation

@mmangkad

Copy link
Copy Markdown
Contributor

Summary

Adds flashinfer-cutedsl for dense NVFP4 GEMM and makes it the highest-priority CUDA backend when supported on SM10x. In serving benchmarks, cutedsl is fastest across concurrency 1-512 and improves tok/s/user by up to 27.07% over the tested FlashInfer backends.

Performance Comparison

Setup:

  • Model: nvidia/Llama-3.1-8B-Instruct-NVFP4
  • Device: SM103
  • Dataset: random
  • Input/output length: 512 input tokens, 512 output tokens
image

Test Plan

CI, which now includes:

  • Extends the FlashInfer NVFP4 GEMM kernel test with cute-dsl.
  • Extends the NVFP4 model test with flashinfer-cutedsl.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new NVFP4 GEMM backend utilizing FlashInfer's CuteDSL, specifically targeting SM10x architectures. The changes include the implementation of the FlashInferCuteDslNvFp4LinearKernel, its registration within the kernel executor, and the addition of flashinfer-cutedsl as a valid environment variable option. Feedback highlights inconsistencies in the backend naming convention, recommending the use of "cutedsl" instead of "cute-dsl" across the codebase and tests for better alignment with existing backend identifiers.

Comment thread vllm/model_executor/kernels/linear/nvfp4/flashinfer.py
Comment thread tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py Outdated
Comment thread tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py

@LopezCastroRoberto LopezCastroRoberto left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already started this integration some time ago in #39933, but I don’t think there’s a clear heuristic for deciding when this backend should be selected.

Across the different shapes I tested, the best results were typically in the range 16 <= bs <= 32. Outside of that range, this backend is not consistently the best option and can actually be significantly slower in some cases.

One example is:

Image

where speedup=1 means a different kernel of the existing ones was selected, but cuteDSL causes a regression

@mmangkad

Copy link
Copy Markdown
Contributor Author

I already started this integration some time ago in #39933, but I don’t think there’s a clear heuristic for deciding when this backend should be selected.

Across the different shapes I tested, the best results were typically in the range 16 <= bs <= 32. Outside of that range, this backend is not consistently the best option and can actually be significantly slower in some cases.

One example is:

Image where speedup=1 means a different kernel of the existing ones was selected, but cuteDSL causes a regression

Could you clarify when these SM100 benchmarks were collected and which FlashInfer version was used? Based on my testing, cute-dsl is almost always better than CUTLASS and cuDNN on both SM103 and SM100, and we should actually expect to see a higher relative speedup on SM100 rather than the regressions shown here.

@LopezCastroRoberto

Copy link
Copy Markdown
Contributor

@mmangkad Yeah, you have a point, this was with 0.6.8 and might have changed since then. I see the latest FI release is 0.6.11.

Can you please benchmark those shapes in my plot to see how different it looks now? I recommend using triton.testing.do_bench_cudagraph for proper time measurement, e.g., https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_nvfp4_gemm.py

@mmangkad

Copy link
Copy Markdown
Contributor Author

@LopezCastroRoberto see below. I reran those shapes with triton.testing.do_bench_cudagraph and compared cute-dsl directly against FI CUTLASS, which is the current highest-priority NVFP4 backend before this PR. This uses the latest FI release on SM100 and SM103.

nvfp4_gemm_cutedsl_speedup_autotune
@LopezCastroRoberto

LopezCastroRoberto commented May 14, 2026

Copy link
Copy Markdown
Contributor

@LopezCastroRoberto see below. I reran those shapes with triton.testing.do_bench_cudagraph and compared cute-dsl directly against FI CUTLASS, which is the current highest-priority NVFP4 backend before this PR. This uses the latest FI release on SM100 and SM103.

nvfp4_gemm_cutedsl_speedup_autotune

Thanks for the results, @mmangkad! Yeah, seems like it might have improved since last time I checked. Just to make sure, would you mind adding flashinfer-trtllm backend to the comparison making sure use_8x4_sf_layout is True when calling flashinfer mm_fp4? We found that for bs<32 that backend was significantly faster than CUTLASS in most cases. See #30885

That way, we would have the full picture and it would be easier to define an heuristic, instead of just adding one more backend to the list.

@mmangkad

Copy link
Copy Markdown
Contributor Author

@LopezCastroRoberto TRTLLM is still strongest at the very smallest M values, especially M=1-4, but CuTeDSL already matches or beats it in many small-M cases and takes over by M=8+. The clearer result is that CuTeDSL is almost always better than the current CUTLASS default across these shapes.

FlashInfer NVFP4 GEMM Results

Each backend cell is TFLOP/s (gap vs best). best is computed within the same device, shape, and M.

Overall Winners

Backend Wins Share
CuTeDSL 73 65.2%
CUTLASS 18 16.1%
TRTLLM 21 18.8%

SM100 winners

Backend Wins Share
CuTeDSL 37 66.1%
CUTLASS 8 14.3%
TRTLLM 11 19.6%

SM103 winners

Backend Wins Share
CuTeDSL 36 64.3%
CUTLASS 10 17.9%
TRTLLM 10 17.9%

SM100

N=7168, K=2048

M CuTeDSL CUTLASS TRTLLM Winner
1 7.75 (-12.61%) 7.59 (-14.39%) 8.87 (best) TRTLLM
2 15.96 (-4.57%) 13.99 (-16.35%) 16.72 (best) TRTLLM
4 30.94 (-6.65%) 28.99 (-12.52%) 33.14 (best) TRTLLM
8 72.41 (best) 60.67 (-16.21%) 70.36 (-2.83%) CuTeDSL
16 144.98 (best) 120.89 (-16.62%) 130.30 (-10.13%) CuTeDSL
32 290.41 (best) 229.20 (-21.08%) 254.27 (-12.44%) CuTeDSL
64 586.55 (best) 460.02 (-21.57%) 372.67 (-36.46%) CuTeDSL
128 1123.69 (best) 957.68 (-14.77%) 724.85 (-35.49%) CuTeDSL
256 1889.30 (best) 1754.17 (-7.15%) 1473.48 (-22.01%) CuTeDSL
512 2757.34 (best) 2718.47 (-1.41%) 1908.06 (-30.80%) CuTeDSL
1024 3355.52 (-2.28%) 3433.64 (best) 2471.99 (-28.01%) CUTLASS
2048 4351.24 (best) 4296.43 (-1.26%) 3198.24 (-26.50%) CuTeDSL
4096 4637.72 (-2.92%) 4777.00 (best) 3491.12 (-26.92%) CUTLASS
8192 4919.04 (best) 4916.80 (-0.05%) 3556.33 (-27.70%) CuTeDSL

N=4096, K=7168

M CuTeDSL CUTLASS TRTLLM Winner
1 9.31 (best) 8.70 (-6.51%) 9.22 (-1.03%) CuTeDSL
2 18.52 (best) 17.75 (-4.13%) 17.86 (-3.55%) CuTeDSL
4 37.14 (best) 36.11 (-2.78%) 35.13 (-5.41%) CuTeDSL
8 77.10 (best) 69.76 (-9.52%) 70.75 (-8.24%) CuTeDSL
16 154.35 (best) 140.15 (-9.20%) 141.89 (-8.07%) CuTeDSL
32 308.70 (best) 280.22 (-9.23%) 280.46 (-9.15%) CuTeDSL
64 619.04 (best) 556.22 (-10.15%) 398.77 (-35.58%) CuTeDSL
128 1221.97 (best) 1097.78 (-10.16%) 779.89 (-36.18%) CuTeDSL
256 2345.46 (best) 2172.83 (-7.36%) 1536.17 (-34.50%) CuTeDSL
512 3953.75 (best) 3819.20 (-3.40%) 2971.94 (-24.83%) CuTeDSL
1024 4827.23 (-0.60%) 4856.49 (best) 3265.48 (-32.76%) CUTLASS
2048 5664.96 (best) 5582.28 (-1.46%) 3647.17 (-35.62%) CuTeDSL
4096 5707.21 (-0.10%) 5713.02 (best) 4242.58 (-25.74%) CUTLASS
8192 6151.10 (best) 5991.68 (-2.59%) 3931.82 (-36.08%) CuTeDSL

N=18432, K=7168

M CuTeDSL CUTLASS TRTLLM Winner
1 23.81 (-31.82%) 19.16 (-45.12%) 34.91 (best) TRTLLM
2 48.89 (-30.44%) 47.55 (-32.35%) 70.29 (best) TRTLLM
4 99.59 (-28.17%) 102.02 (-26.42%) 138.66 (best) TRTLLM
8 232.06 (-19.13%) 206.99 (-27.86%) 286.94 (best) TRTLLM
16 462.80 (-16.96%) 408.13 (-26.77%) 557.33 (best) TRTLLM
32 929.33 (best) 825.79 (-11.14%) 553.18 (-40.48%) CuTeDSL
64 2254.95 (best) 1712.88 (-24.04%) 898.07 (-60.17%) CuTeDSL
128 3832.20 (best) 3373.04 (-11.98%) 3054.55 (-20.29%) CuTeDSL
256 5070.73 (best) 4942.17 (-2.54%) 3256.12 (-35.79%) CuTeDSL
512 5359.51 (best) 5307.24 (-0.98%) 3621.65 (-32.43%) CuTeDSL
1024 5506.23 (best) 5430.92 (-1.37%) 3673.10 (-33.29%) CuTeDSL
2048 5779.01 (best) 5636.57 (-2.46%) 3384.32 (-41.44%) CuTeDSL
4096 5466.57 (-4.88%) 5746.94 (best) 3473.93 (-39.55%) CUTLASS
8192 5681.96 (best) 5483.52 (-3.49%) 3490.44 (-38.57%) CuTeDSL

N=7168, K=18432

M CuTeDSL CUTLASS TRTLLM Winner
1 17.29 (-12.04%) 15.99 (-18.66%) 19.66 (best) TRTLLM
2 37.68 (-2.86%) 31.97 (-17.58%) 38.79 (best) TRTLLM
4 73.76 (-6.20%) 63.85 (-18.79%) 78.63 (best) TRTLLM
8 172.87 (best) 128.22 (-25.83%) 156.05 (-9.73%) CuTeDSL
16 345.76 (best) 255.74 (-26.04%) 310.93 (-10.07%) CuTeDSL
32 660.78 (best) 477.45 (-27.74%) 591.83 (-10.44%) CuTeDSL
64 1377.38 (best) 1033.90 (-24.94%) 862.33 (-37.39%) CuTeDSL
128 2582.43 (best) 1897.42 (-26.53%) 1702.98 (-34.05%) CuTeDSL
256 4570.02 (best) 4432.55 (-3.01%) 2802.01 (-38.69%) CuTeDSL
512 4834.91 (-0.37%) 4852.64 (best) 2930.76 (-39.60%) CUTLASS
1024 5372.21 (best) 5212.10 (-2.98%) 3034.59 (-43.51%) CuTeDSL
2048 5516.94 (best) 5284.15 (-4.22%) 3288.94 (-40.38%) CuTeDSL
4096 5572.54 (-3.29%) 5762.37 (best) 3350.80 (-41.85%) CUTLASS
8192 5002.21 (-8.12%) 5444.40 (best) 3707.88 (-31.90%) CUTLASS

SM103

N=7168, K=2048

M CuTeDSL CUTLASS TRTLLM Winner
1 8.26 (-11.02%) 7.72 (-16.91%) 9.29 (best) TRTLLM
2 16.80 (-2.49%) 14.93 (-13.33%) 17.23 (best) TRTLLM
4 32.54 (-7.83%) 29.90 (-15.30%) 35.30 (best) TRTLLM
8 75.32 (best) 59.74 (-20.68%) 69.53 (-7.69%) CuTeDSL
16 146.08 (best) 120.41 (-17.58%) 139.20 (-4.71%) CuTeDSL
32 304.86 (best) 239.83 (-21.33%) 257.04 (-15.68%) CuTeDSL
64 616.77 (best) 479.07 (-22.33%) 394.67 (-36.01%) CuTeDSL
128 1141.56 (best) 963.42 (-15.60%) 763.52 (-33.12%) CuTeDSL
256 2039.46 (best) 1861.99 (-8.70%) 1479.90 (-27.44%) CuTeDSL
512 2988.48 (best) 2736.61 (-8.43%) 1983.47 (-33.63%) CuTeDSL
1024 3553.24 (-2.43%) 3641.57 (best) 2563.44 (-29.61%) CUTLASS
2048 4550.94 (best) 4392.35 (-3.48%) 3245.54 (-28.68%) CuTeDSL
4096 4920.28 (best) 4894.33 (-0.53%) 3558.20 (-27.68%) CuTeDSL
8192 5148.52 (-1.64%) 5234.12 (best) 3413.34 (-34.79%) CUTLASS

N=4096, K=7168

M CuTeDSL CUTLASS TRTLLM Winner
1 9.03 (-3.06%) 8.26 (-11.37%) 9.31 (best) TRTLLM
2 19.18 (best) 17.22 (-10.23%) 18.43 (-3.89%) CuTeDSL
4 38.59 (best) 37.81 (-2.03%) 36.83 (-4.56%) CuTeDSL
8 81.02 (best) 73.13 (-9.74%) 73.80 (-8.91%) CuTeDSL
16 161.90 (best) 146.44 (-9.55%) 147.29 (-9.03%) CuTeDSL
32 324.05 (best) 293.18 (-9.53%) 292.25 (-9.81%) CuTeDSL
64 653.75 (best) 584.33 (-10.62%) 412.80 (-36.86%) CuTeDSL
128 1274.86 (best) 1151.05 (-9.71%) 807.53 (-36.66%) CuTeDSL
256 2487.46 (best) 2320.52 (-6.71%) 1598.98 (-35.72%) CuTeDSL
512 4331.05 (best) 4087.08 (-5.63%) 3075.78 (-28.98%) CuTeDSL
1024 5160.79 (best) 5035.26 (-2.43%) 3388.57 (-34.34%) CuTeDSL
2048 5865.40 (best) 5856.70 (-0.15%) 3677.14 (-37.31%) CuTeDSL
4096 6209.12 (-3.70%) 6447.84 (best) 4181.73 (-35.15%) CUTLASS
8192 5629.43 (best) 5615.45 (-0.25%) 3947.02 (-29.89%) CuTeDSL

N=18432, K=7168

M CuTeDSL CUTLASS TRTLLM Winner
1 25.02 (-33.65%) 21.57 (-42.79%) 37.71 (best) TRTLLM
2 50.75 (-31.76%) 47.81 (-35.72%) 74.37 (best) TRTLLM
4 102.43 (-31.62%) 100.24 (-33.08%) 149.80 (best) TRTLLM
8 246.51 (-19.50%) 208.99 (-31.75%) 306.21 (best) TRTLLM
16 483.16 (-18.65%) 412.14 (-30.60%) 593.90 (best) TRTLLM
32 963.08 (best) 832.79 (-13.53%) 585.98 (-39.16%) CuTeDSL
64 2388.24 (best) 1765.84 (-26.06%) 938.92 (-60.69%) CuTeDSL
128 4138.86 (best) 3577.81 (-13.56%) 3203.88 (-22.59%) CuTeDSL
256 5331.70 (best) 5207.50 (-2.33%) 3343.82 (-37.28%) CuTeDSL
512 5786.16 (best) 5264.24 (-9.02%) 3680.79 (-36.39%) CuTeDSL
1024 6007.93 (best) 5940.19 (-1.13%) 3569.01 (-40.59%) CuTeDSL
2048 5535.25 (best) 5356.93 (-3.22%) 3455.75 (-37.57%) CuTeDSL
4096 5276.37 (-1.82%) 5374.20 (best) 3551.15 (-33.92%) CUTLASS
8192 5346.58 (-4.00%) 5569.13 (best) 3568.93 (-35.92%) CUTLASS

N=7168, K=18432

M CuTeDSL CUTLASS TRTLLM Winner
1 17.97 (-11.04%) 16.87 (-16.48%) 20.20 (best) TRTLLM
2 41.76 (best) 33.07 (-20.81%) 40.28 (-3.53%) CuTeDSL
4 82.51 (best) 65.98 (-20.02%) 79.96 (-3.08%) CuTeDSL
8 177.90 (best) 132.71 (-25.40%) 161.40 (-9.28%) CuTeDSL
16 353.05 (best) 280.15 (-20.65%) 317.09 (-10.19%) CuTeDSL
32 704.78 (best) 551.51 (-21.75%) 621.58 (-11.80%) CuTeDSL
64 1368.03 (best) 1074.71 (-21.44%) 779.15 (-43.05%) CuTeDSL
128 2654.48 (best) 2082.84 (-21.53%) 1739.80 (-34.46%) CuTeDSL
256 4591.57 (best) 4442.42 (-3.25%) 2904.12 (-36.75%) CuTeDSL
512 5248.23 (-0.15%) 5255.86 (best) 3039.43 (-42.17%) CUTLASS
1024 5598.81 (-6.27%) 5973.28 (best) 2949.47 (-50.62%) CUTLASS
2048 5114.07 (-8.83%) 5609.35 (best) 3206.56 (-42.84%) CUTLASS
4096 5386.57 (-8.61%) 5893.87 (best) 3284.85 (-44.27%) CUTLASS
8192 4975.49 (-14.80%) 5839.75 (best) 3264.02 (-44.11%) CUTLASS
@LopezCastroRoberto

LopezCastroRoberto commented May 18, 2026

Copy link
Copy Markdown
Contributor

Thanks for the results, @mmangkad. Yeah, I think this makes sense.

We should also update the FI version to the latest, i.e. 0.6.11.post3?
https://github.com/LopezCastroRoberto/vllm/blob/main/requirements/cuda.txt

@LopezCastroRoberto

Copy link
Copy Markdown
Contributor

cc: @mgoin

@mmangkad

mmangkad commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the results, @mmangkad. Yeah, I think this makes sense.

We should also update the FI version to the latest, i.e. 0.6.11.post3? https://github.com/LopezCastroRoberto/vllm/blob/main/requirements/cuda.txt

@LopezCastroRoberto we are already at 0.6.11.post2 now, but I think we can include upgrade to 0.6.11.post3 here if you prefer that

@LopezCastroRoberto

LopezCastroRoberto commented May 18, 2026

Copy link
Copy Markdown
Contributor

Nevermind, my bad. I accidentally checked my own fork instead of upstream. Waiting for @mgoin approval.

@mmangkad mmangkad force-pushed the add-flashinfer-cutedsl-nvfp4 branch 2 times, most recently from 3b67dbc to d2c176d Compare May 18, 2026 11:41
@mmangkad mmangkad force-pushed the add-flashinfer-cutedsl-nvfp4 branch from d2c176d to 3700b17 Compare May 18, 2026 11:44
@LopezCastroRoberto

LopezCastroRoberto commented May 28, 2026

Copy link
Copy Markdown
Contributor

@mmangkad -- following up on the FlashInfer autotuning issue I flagged earlier (flashinfer-ai/flashinfer#3295). The discussion has progressed and there's now a concrete fix, so wanted to share the conclusions since they directly affect this PR.

Right now vLLM defaults to O2, which has enable_flashinfer_autotune=True (re-enabled in #42857). Once this PR lands making cute-dsl the highest-priority NVFP4 backend, every default NVFP4 deployment will autotune mm_fp4 cuteDSL kernels at startup.

Interestingly, seems like autotuning mm_fp4 cuteDSL is unnecessary. PR flashinfer-ai/flashinfer#2940 added a heuristic that closes the autotuned vs non-autotuned perf gap. The heuristic predicts the best config for each (N, K) combination in <100us on first call, and subsequent lookups are <0.2us.

To fix this, flashinfer-ai/flashinfer#3396 adds a skip_ops mechanism:

with flashinfer.autotune(skip_ops="fp4_gemm"):
      ...

This brought warmup from 587s → 8s on DSV3.2-NVFP4 TP=4.

I think we should track a follow-up to integrate skip_ops when the FI version is bumped to include #3396, and then merge this PR too.

cc: @mgoin

@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mmangkad.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 11, 2026
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
@mmangkad mmangkad force-pushed the add-flashinfer-cutedsl-nvfp4 branch from c34ce76 to fde0c77 Compare June 17, 2026 14:23
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 17, 2026
@mergify

mergify Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Hi @mmangkad, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@mgoin

mgoin commented Jun 17, 2026

Copy link
Copy Markdown
Member

@mmangkad can you please fix the conflict

@mergify mergify Bot removed the needs-rebase label Jun 17, 2026
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
@mmangkad

Copy link
Copy Markdown
Contributor Author

@mgoin looks green now

Comment on lines +76 to +78
x_fp4 = pad_nvfp4_activation_for_cutlass(
x_fp4, getattr(layer, "weights_padding_cols", 0)
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this padding needed for cutedsl as well?

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Jun 22, 2026
@mgoin

mgoin commented Jun 22, 2026

Copy link
Copy Markdown
Member

Also is there a plan to close the perf gap for small M? I'll merge for now since it seems good overall

@mgoin mgoin merged commit d1a38c2 into vllm-project:main Jun 22, 2026
92 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 22, 2026
@mmangkad mmangkad deleted the add-flashinfer-cutedsl-nvfp4 branch June 24, 2026 00:46
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…-project#42235)

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
…-project#42235)

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
@wzhao18 wzhao18 mentioned this pull request Jun 26, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed

3 participants