[Kernel][Performance] Add FlashInfer cutedsl NVFP4 GEMM backend by mmangkad · Pull Request #42235 · vllm-project/vllm

mmangkad · 2026-05-10T15:27:17Z

Summary

Adds flashinfer-cutedsl for dense NVFP4 GEMM and makes it the highest-priority CUDA backend when supported on SM10x. In serving benchmarks, cutedsl is fastest across concurrency 1-512 and improves tok/s/user by up to 27.07% over the tested FlashInfer backends.

Performance Comparison

Setup:

Model: nvidia/Llama-3.1-8B-Instruct-NVFP4
Device: SM103
Dataset: random
Input/output length: 512 input tokens, 512 output tokens

Test Plan

CI, which now includes:

Extends the FlashInfer NVFP4 GEMM kernel test with cute-dsl.
Extends the NVFP4 model test with flashinfer-cutedsl.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request introduces a new NVFP4 GEMM backend utilizing FlashInfer's CuteDSL, specifically targeting SM10x architectures. The changes include the implementation of the FlashInferCuteDslNvFp4LinearKernel, its registration within the kernel executor, and the addition of flashinfer-cutedsl as a valid environment variable option. Feedback highlights inconsistencies in the backend naming convention, recommending the use of "cutedsl" instead of "cute-dsl" across the codebase and tests for better alignment with existing backend identifiers.

LopezCastroRoberto

I already started this integration some time ago in #39933, but I don’t think there’s a clear heuristic for deciding when this backend should be selected.

Across the different shapes I tested, the best results were typically in the range 16 <= bs <= 32. Outside of that range, this backend is not consistently the best option and can actually be significantly slower in some cases.

One example is:

where speedup=1 means a different kernel of the existing ones was selected, but cuteDSL causes a regression

mmangkad · 2026-05-13T16:37:24Z

I already started this integration some time ago in #39933, but I don’t think there’s a clear heuristic for deciding when this backend should be selected.

Across the different shapes I tested, the best results were typically in the range 16 <= bs <= 32. Outside of that range, this backend is not consistently the best option and can actually be significantly slower in some cases.

One example is:
where speedup=1 means a different kernel of the existing ones was selected, but cuteDSL causes a regression

Could you clarify when these SM100 benchmarks were collected and which FlashInfer version was used? Based on my testing, cute-dsl is almost always better than CUTLASS and cuDNN on both SM103 and SM100, and we should actually expect to see a higher relative speedup on SM100 rather than the regressions shown here.

LopezCastroRoberto · 2026-05-13T16:48:22Z

@mmangkad Yeah, you have a point, this was with 0.6.8 and might have changed since then. I see the latest FI release is 0.6.11.

Can you please benchmark those shapes in my plot to see how different it looks now? I recommend using triton.testing.do_bench_cudagraph for proper time measurement, e.g., https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_nvfp4_gemm.py

mmangkad · 2026-05-13T18:35:13Z

@LopezCastroRoberto see below. I reran those shapes with triton.testing.do_bench_cudagraph and compared cute-dsl directly against FI CUTLASS, which is the current highest-priority NVFP4 backend before this PR. This uses the latest FI release on SM100 and SM103.

LopezCastroRoberto · 2026-05-14T13:34:30Z

@LopezCastroRoberto see below. I reran those shapes with triton.testing.do_bench_cudagraph and compared cute-dsl directly against FI CUTLASS, which is the current highest-priority NVFP4 backend before this PR. This uses the latest FI release on SM100 and SM103.

Thanks for the results, @mmangkad! Yeah, seems like it might have improved since last time I checked. Just to make sure, would you mind adding flashinfer-trtllm backend to the comparison making sure use_8x4_sf_layout is True when calling flashinfer mm_fp4? We found that for bs<32 that backend was significantly faster than CUTLASS in most cases. See #30885

That way, we would have the full picture and it would be easier to define an heuristic, instead of just adding one more backend to the list.

mmangkad · 2026-05-14T14:07:16Z

@LopezCastroRoberto TRTLLM is still strongest at the very smallest M values, especially M=1-4, but CuTeDSL already matches or beats it in many small-M cases and takes over by M=8+. The clearer result is that CuTeDSL is almost always better than the current CUTLASS default across these shapes.

FlashInfer NVFP4 GEMM Results

Each backend cell is TFLOP/s (gap vs best). best is computed within the same device, shape, and M.

Overall Winners

Backend	Wins	Share
CuTeDSL	73	65.2%
CUTLASS	18	16.1%
TRTLLM	21	18.8%

SM100 winners

Backend	Wins	Share
CuTeDSL	37	66.1%
CUTLASS	8	14.3%
TRTLLM	11	19.6%

SM103 winners

Backend	Wins	Share
CuTeDSL	36	64.3%
CUTLASS	10	17.9%
TRTLLM	10	17.9%

SM100

N=7168, K=2048

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	7.75 (-12.61%)	7.59 (-14.39%)	8.87 (best)	TRTLLM
2	15.96 (-4.57%)	13.99 (-16.35%)	16.72 (best)	TRTLLM
4	30.94 (-6.65%)	28.99 (-12.52%)	33.14 (best)	TRTLLM
8	72.41 (best)	60.67 (-16.21%)	70.36 (-2.83%)	CuTeDSL
16	144.98 (best)	120.89 (-16.62%)	130.30 (-10.13%)	CuTeDSL
32	290.41 (best)	229.20 (-21.08%)	254.27 (-12.44%)	CuTeDSL
64	586.55 (best)	460.02 (-21.57%)	372.67 (-36.46%)	CuTeDSL
128	1123.69 (best)	957.68 (-14.77%)	724.85 (-35.49%)	CuTeDSL
256	1889.30 (best)	1754.17 (-7.15%)	1473.48 (-22.01%)	CuTeDSL
512	2757.34 (best)	2718.47 (-1.41%)	1908.06 (-30.80%)	CuTeDSL
1024	3355.52 (-2.28%)	3433.64 (best)	2471.99 (-28.01%)	CUTLASS
2048	4351.24 (best)	4296.43 (-1.26%)	3198.24 (-26.50%)	CuTeDSL
4096	4637.72 (-2.92%)	4777.00 (best)	3491.12 (-26.92%)	CUTLASS
8192	4919.04 (best)	4916.80 (-0.05%)	3556.33 (-27.70%)	CuTeDSL

N=4096, K=7168

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	9.31 (best)	8.70 (-6.51%)	9.22 (-1.03%)	CuTeDSL
2	18.52 (best)	17.75 (-4.13%)	17.86 (-3.55%)	CuTeDSL
4	37.14 (best)	36.11 (-2.78%)	35.13 (-5.41%)	CuTeDSL
8	77.10 (best)	69.76 (-9.52%)	70.75 (-8.24%)	CuTeDSL
16	154.35 (best)	140.15 (-9.20%)	141.89 (-8.07%)	CuTeDSL
32	308.70 (best)	280.22 (-9.23%)	280.46 (-9.15%)	CuTeDSL
64	619.04 (best)	556.22 (-10.15%)	398.77 (-35.58%)	CuTeDSL
128	1221.97 (best)	1097.78 (-10.16%)	779.89 (-36.18%)	CuTeDSL
256	2345.46 (best)	2172.83 (-7.36%)	1536.17 (-34.50%)	CuTeDSL
512	3953.75 (best)	3819.20 (-3.40%)	2971.94 (-24.83%)	CuTeDSL
1024	4827.23 (-0.60%)	4856.49 (best)	3265.48 (-32.76%)	CUTLASS
2048	5664.96 (best)	5582.28 (-1.46%)	3647.17 (-35.62%)	CuTeDSL
4096	5707.21 (-0.10%)	5713.02 (best)	4242.58 (-25.74%)	CUTLASS
8192	6151.10 (best)	5991.68 (-2.59%)	3931.82 (-36.08%)	CuTeDSL

N=18432, K=7168

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	23.81 (-31.82%)	19.16 (-45.12%)	34.91 (best)	TRTLLM
2	48.89 (-30.44%)	47.55 (-32.35%)	70.29 (best)	TRTLLM
4	99.59 (-28.17%)	102.02 (-26.42%)	138.66 (best)	TRTLLM
8	232.06 (-19.13%)	206.99 (-27.86%)	286.94 (best)	TRTLLM
16	462.80 (-16.96%)	408.13 (-26.77%)	557.33 (best)	TRTLLM
32	929.33 (best)	825.79 (-11.14%)	553.18 (-40.48%)	CuTeDSL
64	2254.95 (best)	1712.88 (-24.04%)	898.07 (-60.17%)	CuTeDSL
128	3832.20 (best)	3373.04 (-11.98%)	3054.55 (-20.29%)	CuTeDSL
256	5070.73 (best)	4942.17 (-2.54%)	3256.12 (-35.79%)	CuTeDSL
512	5359.51 (best)	5307.24 (-0.98%)	3621.65 (-32.43%)	CuTeDSL
1024	5506.23 (best)	5430.92 (-1.37%)	3673.10 (-33.29%)	CuTeDSL
2048	5779.01 (best)	5636.57 (-2.46%)	3384.32 (-41.44%)	CuTeDSL
4096	5466.57 (-4.88%)	5746.94 (best)	3473.93 (-39.55%)	CUTLASS
8192	5681.96 (best)	5483.52 (-3.49%)	3490.44 (-38.57%)	CuTeDSL

N=7168, K=18432

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	17.29 (-12.04%)	15.99 (-18.66%)	19.66 (best)	TRTLLM
2	37.68 (-2.86%)	31.97 (-17.58%)	38.79 (best)	TRTLLM
4	73.76 (-6.20%)	63.85 (-18.79%)	78.63 (best)	TRTLLM
8	172.87 (best)	128.22 (-25.83%)	156.05 (-9.73%)	CuTeDSL
16	345.76 (best)	255.74 (-26.04%)	310.93 (-10.07%)	CuTeDSL
32	660.78 (best)	477.45 (-27.74%)	591.83 (-10.44%)	CuTeDSL
64	1377.38 (best)	1033.90 (-24.94%)	862.33 (-37.39%)	CuTeDSL
128	2582.43 (best)	1897.42 (-26.53%)	1702.98 (-34.05%)	CuTeDSL
256	4570.02 (best)	4432.55 (-3.01%)	2802.01 (-38.69%)	CuTeDSL
512	4834.91 (-0.37%)	4852.64 (best)	2930.76 (-39.60%)	CUTLASS
1024	5372.21 (best)	5212.10 (-2.98%)	3034.59 (-43.51%)	CuTeDSL
2048	5516.94 (best)	5284.15 (-4.22%)	3288.94 (-40.38%)	CuTeDSL
4096	5572.54 (-3.29%)	5762.37 (best)	3350.80 (-41.85%)	CUTLASS
8192	5002.21 (-8.12%)	5444.40 (best)	3707.88 (-31.90%)	CUTLASS

SM103

N=7168, K=2048

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	8.26 (-11.02%)	7.72 (-16.91%)	9.29 (best)	TRTLLM
2	16.80 (-2.49%)	14.93 (-13.33%)	17.23 (best)	TRTLLM
4	32.54 (-7.83%)	29.90 (-15.30%)	35.30 (best)	TRTLLM
8	75.32 (best)	59.74 (-20.68%)	69.53 (-7.69%)	CuTeDSL
16	146.08 (best)	120.41 (-17.58%)	139.20 (-4.71%)	CuTeDSL
32	304.86 (best)	239.83 (-21.33%)	257.04 (-15.68%)	CuTeDSL
64	616.77 (best)	479.07 (-22.33%)	394.67 (-36.01%)	CuTeDSL
128	1141.56 (best)	963.42 (-15.60%)	763.52 (-33.12%)	CuTeDSL
256	2039.46 (best)	1861.99 (-8.70%)	1479.90 (-27.44%)	CuTeDSL
512	2988.48 (best)	2736.61 (-8.43%)	1983.47 (-33.63%)	CuTeDSL
1024	3553.24 (-2.43%)	3641.57 (best)	2563.44 (-29.61%)	CUTLASS
2048	4550.94 (best)	4392.35 (-3.48%)	3245.54 (-28.68%)	CuTeDSL
4096	4920.28 (best)	4894.33 (-0.53%)	3558.20 (-27.68%)	CuTeDSL
8192	5148.52 (-1.64%)	5234.12 (best)	3413.34 (-34.79%)	CUTLASS

N=4096, K=7168

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	9.03 (-3.06%)	8.26 (-11.37%)	9.31 (best)	TRTLLM
2	19.18 (best)	17.22 (-10.23%)	18.43 (-3.89%)	CuTeDSL
4	38.59 (best)	37.81 (-2.03%)	36.83 (-4.56%)	CuTeDSL
8	81.02 (best)	73.13 (-9.74%)	73.80 (-8.91%)	CuTeDSL
16	161.90 (best)	146.44 (-9.55%)	147.29 (-9.03%)	CuTeDSL
32	324.05 (best)	293.18 (-9.53%)	292.25 (-9.81%)	CuTeDSL
64	653.75 (best)	584.33 (-10.62%)	412.80 (-36.86%)	CuTeDSL
128	1274.86 (best)	1151.05 (-9.71%)	807.53 (-36.66%)	CuTeDSL
256	2487.46 (best)	2320.52 (-6.71%)	1598.98 (-35.72%)	CuTeDSL
512	4331.05 (best)	4087.08 (-5.63%)	3075.78 (-28.98%)	CuTeDSL
1024	5160.79 (best)	5035.26 (-2.43%)	3388.57 (-34.34%)	CuTeDSL
2048	5865.40 (best)	5856.70 (-0.15%)	3677.14 (-37.31%)	CuTeDSL
4096	6209.12 (-3.70%)	6447.84 (best)	4181.73 (-35.15%)	CUTLASS
8192	5629.43 (best)	5615.45 (-0.25%)	3947.02 (-29.89%)	CuTeDSL

N=18432, K=7168

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	25.02 (-33.65%)	21.57 (-42.79%)	37.71 (best)	TRTLLM
2	50.75 (-31.76%)	47.81 (-35.72%)	74.37 (best)	TRTLLM
4	102.43 (-31.62%)	100.24 (-33.08%)	149.80 (best)	TRTLLM
8	246.51 (-19.50%)	208.99 (-31.75%)	306.21 (best)	TRTLLM
16	483.16 (-18.65%)	412.14 (-30.60%)	593.90 (best)	TRTLLM
32	963.08 (best)	832.79 (-13.53%)	585.98 (-39.16%)	CuTeDSL
64	2388.24 (best)	1765.84 (-26.06%)	938.92 (-60.69%)	CuTeDSL
128	4138.86 (best)	3577.81 (-13.56%)	3203.88 (-22.59%)	CuTeDSL
256	5331.70 (best)	5207.50 (-2.33%)	3343.82 (-37.28%)	CuTeDSL
512	5786.16 (best)	5264.24 (-9.02%)	3680.79 (-36.39%)	CuTeDSL
1024	6007.93 (best)	5940.19 (-1.13%)	3569.01 (-40.59%)	CuTeDSL
2048	5535.25 (best)	5356.93 (-3.22%)	3455.75 (-37.57%)	CuTeDSL
4096	5276.37 (-1.82%)	5374.20 (best)	3551.15 (-33.92%)	CUTLASS
8192	5346.58 (-4.00%)	5569.13 (best)	3568.93 (-35.92%)	CUTLASS

N=7168, K=18432

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	17.97 (-11.04%)	16.87 (-16.48%)	20.20 (best)	TRTLLM
2	41.76 (best)	33.07 (-20.81%)	40.28 (-3.53%)	CuTeDSL
4	82.51 (best)	65.98 (-20.02%)	79.96 (-3.08%)	CuTeDSL
8	177.90 (best)	132.71 (-25.40%)	161.40 (-9.28%)	CuTeDSL
16	353.05 (best)	280.15 (-20.65%)	317.09 (-10.19%)	CuTeDSL
32	704.78 (best)	551.51 (-21.75%)	621.58 (-11.80%)	CuTeDSL
64	1368.03 (best)	1074.71 (-21.44%)	779.15 (-43.05%)	CuTeDSL
128	2654.48 (best)	2082.84 (-21.53%)	1739.80 (-34.46%)	CuTeDSL
256	4591.57 (best)	4442.42 (-3.25%)	2904.12 (-36.75%)	CuTeDSL
512	5248.23 (-0.15%)	5255.86 (best)	3039.43 (-42.17%)	CUTLASS
1024	5598.81 (-6.27%)	5973.28 (best)	2949.47 (-50.62%)	CUTLASS
2048	5114.07 (-8.83%)	5609.35 (best)	3206.56 (-42.84%)	CUTLASS
4096	5386.57 (-8.61%)	5893.87 (best)	3284.85 (-44.27%)	CUTLASS
8192	4975.49 (-14.80%)	5839.75 (best)	3264.02 (-44.11%)	CUTLASS

LopezCastroRoberto · 2026-05-18T10:23:53Z

Thanks for the results, @mmangkad. Yeah, I think this makes sense.

We should also update the FI version to the latest, i.e. 0.6.11.post3?
https://github.com/LopezCastroRoberto/vllm/blob/main/requirements/cuda.txt

LopezCastroRoberto · 2026-05-18T10:29:16Z

cc: @mgoin

mmangkad · 2026-05-18T11:01:22Z

Thanks for the results, @mmangkad. Yeah, I think this makes sense.

We should also update the FI version to the latest, i.e. 0.6.11.post3? https://github.com/LopezCastroRoberto/vllm/blob/main/requirements/cuda.txt

@LopezCastroRoberto we are already at 0.6.11.post2 now, but I think we can include upgrade to 0.6.11.post3 here if you prefer that

LopezCastroRoberto · 2026-05-18T11:04:26Z

Nevermind, my bad. I accidentally checked my own fork instead of upstream. Waiting for @mgoin approval.

LopezCastroRoberto · 2026-05-28T11:37:42Z

@mmangkad -- following up on the FlashInfer autotuning issue I flagged earlier (flashinfer-ai/flashinfer#3295). The discussion has progressed and there's now a concrete fix, so wanted to share the conclusions since they directly affect this PR.

Right now vLLM defaults to O2, which has enable_flashinfer_autotune=True (re-enabled in #42857). Once this PR lands making cute-dsl the highest-priority NVFP4 backend, every default NVFP4 deployment will autotune mm_fp4 cuteDSL kernels at startup.

Interestingly, seems like autotuning mm_fp4 cuteDSL is unnecessary. PR flashinfer-ai/flashinfer#2940 added a heuristic that closes the autotuned vs non-autotuned perf gap. The heuristic predicts the best config for each (N, K) combination in <100us on first call, and subsequent lookups are <0.2us.

To fix this, flashinfer-ai/flashinfer#3396 adds a skip_ops mechanism:

with flashinfer.autotune(skip_ops="fp4_gemm"):
      ...

This brought warmup from 587s → 8s on DSV3.2-NVFP4 TP=4.

I think we should track a follow-up to integrate skip_ops when the FI version is bumped to include #3396, and then merge this PR too.

cc: @mgoin

mergify · 2026-06-11T16:28:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mmangkad.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>

mergify · 2026-06-17T14:26:44Z

Hi @mmangkad, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

mgoin · 2026-06-17T14:27:04Z

@mmangkad can you please fix the conflict

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>

mmangkad · 2026-06-18T14:38:09Z

@mgoin looks green now

mgoin · 2026-06-22T20:16:06Z

+        x_fp4 = pad_nvfp4_activation_for_cutlass(
+            x_fp4, getattr(layer, "weights_padding_cols", 0)
+        )


Is this padding needed for cutedsl as well?

mgoin · 2026-06-22T20:17:13Z

Also is there a plan to close the perf gap for small M? I'll merge for now since it seems good overall

…-project#42235) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>

…-project#42235) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

mmangkad requested review from DarkLight1337, WoosukKwon, mgoin, tlrmchlsmth, yewentao256, ywang96 and zyongye as code owners May 10, 2026 15:27

claude Bot reviewed May 10, 2026

View reviewed changes

mergify Bot added the nvidia label May 10, 2026

github-project-automation Bot added this to NVIDIA May 10, 2026

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

Comment thread vllm/model_executor/kernels/linear/nvfp4/flashinfer.py

Comment thread tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py Outdated

Comment thread tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py

LopezCastroRoberto reviewed May 13, 2026

View reviewed changes

LopezCastroRoberto approved these changes May 18, 2026

View reviewed changes

mmangkad force-pushed the add-flashinfer-cutedsl-nvfp4 branch 2 times, most recently from 3b67dbc to d2c176d Compare May 18, 2026 11:41

mmangkad requested review from ProExpertProg, hmellor, houseroad, robertgshaw2-redhat and youkaichao as code owners May 18, 2026 11:41

mmangkad force-pushed the add-flashinfer-cutedsl-nvfp4 branch from d2c176d to 3700b17 Compare May 18, 2026 11:44

meena-at-work mentioned this pull request May 20, 2026

[CI/Build] Bump flashinfer to v0.6.11.post3 #43251

Open

qiching mentioned this pull request May 22, 2026

The autotune speed of mm_fp4 with backend=cute-dsl is slow flashinfer-ai/flashinfer#3295

Closed

mmangkad force-pushed the add-flashinfer-cutedsl-nvfp4 branch from 3700b17 to c34ce76 Compare May 23, 2026 04:31

mergify Bot added the needs-rebase label Jun 11, 2026

Add FlashInfer cutedsl NVFP4 GEMM backend

fde0c77

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>

mmangkad force-pushed the add-flashinfer-cutedsl-nvfp4 branch from c34ce76 to fde0c77 Compare June 17, 2026 14:23

mmangkad requested a review from AndreasKaratzas as a code owner June 17, 2026 14:23

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 17, 2026

mergify Bot removed the needs-rebase label Jun 17, 2026

Add async TP fusion pattern for FlashInfer cute-dsl FP4

8b2c021

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>

mmangkad requested review from BoyuanFeng, vadiklyutiy and zou3519 as code owners June 18, 2026 05:17

mmangkad added 2 commits June 18, 2026 13:17

Merge branch 'main' into add-flashinfer-cutedsl-nvfp4

73e8ef9

Merge branch 'main' into add-flashinfer-cutedsl-nvfp4

213f91a

mgoin approved these changes Jun 22, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA Jun 22, 2026

mgoin merged commit d1a38c2 into vllm-project:main Jun 22, 2026
92 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 22, 2026

mgoin mentioned this pull request Jun 22, 2026

[WIP][Perf] Add FlashInfer CuTeDSL backend for NVFP4 GEMM on Blackwell #39933

Closed

mmangkad deleted the add-flashinfer-cutedsl-nvfp4 branch June 24, 2026 00:46

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[Kernel][Performance] Add FlashInfer cutedsl NVFP4 GEMM backend (vllm…

2086681

…-project#42235) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>

wzhao18 mentioned this pull request Jun 26, 2026

Bump flashinfer version to 0.6.13 #46683

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel][Performance] Add FlashInfer cutedsl NVFP4 GEMM backend#42235

[Kernel][Performance] Add FlashInfer cutedsl NVFP4 GEMM backend#42235
mgoin merged 4 commits into
vllm-project:mainfrom
mmangkad:add-flashinfer-cutedsl-nvfp4

mmangkad commented May 10, 2026

claude Bot left a comment

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

LopezCastroRoberto left a comment •

edited

Loading

mmangkad commented May 13, 2026

LopezCastroRoberto commented May 13, 2026

mmangkad commented May 13, 2026

LopezCastroRoberto commented May 14, 2026 •

edited

Loading

mmangkad commented May 14, 2026

LopezCastroRoberto commented May 18, 2026 •

edited

Loading

LopezCastroRoberto commented May 18, 2026

mmangkad commented May 18, 2026 •

edited

Loading

LopezCastroRoberto commented May 18, 2026 •

edited

Loading

LopezCastroRoberto commented May 28, 2026 •

edited

Loading

mergify Bot commented Jun 11, 2026

mergify Bot commented Jun 17, 2026

mgoin commented Jun 17, 2026

mmangkad commented Jun 18, 2026

mgoin Jun 22, 2026

mgoin commented Jun 22, 2026

Uh oh!

Labels

3 participants

Uh oh!

Uh oh!

Conversation

mmangkad commented May 10, 2026

Summary

Performance Comparison

Test Plan

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

LopezCastroRoberto left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

mmangkad commented May 13, 2026

LopezCastroRoberto commented May 13, 2026

mmangkad commented May 13, 2026

LopezCastroRoberto commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mmangkad commented May 14, 2026

FlashInfer NVFP4 GEMM Results

Overall Winners

SM100 winners

SM103 winners

SM100

N=7168, K=2048

N=4096, K=7168

N=18432, K=7168

N=7168, K=18432

SM103

N=7168, K=2048

N=4096, K=7168

N=18432, K=7168

N=7168, K=18432

LopezCastroRoberto commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LopezCastroRoberto commented May 18, 2026

mmangkad commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LopezCastroRoberto commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LopezCastroRoberto commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mergify Bot commented Jun 11, 2026

mergify Bot commented Jun 17, 2026

mgoin commented Jun 17, 2026

mmangkad commented Jun 18, 2026

mgoin Jun 22, 2026

Choose a reason for hiding this comment

mgoin commented Jun 22, 2026

Uh oh!

Labels

3 participants

LopezCastroRoberto left a comment •

edited

Loading

LopezCastroRoberto commented May 14, 2026 •

edited

Loading

LopezCastroRoberto commented May 18, 2026 •

edited

Loading

mmangkad commented May 18, 2026 •

edited

Loading

LopezCastroRoberto commented May 18, 2026 •

edited

Loading

LopezCastroRoberto commented May 28, 2026 •

edited

Loading