[Bugfix] Allow flashinfer_cutlass as a clamped NVFP4 MoE backend by lucifer1004 · Pull Request #46492 · vllm-project/vllm

lucifer1004 · 2026-06-23T12:31:12Z

Models that set a SwiGLU clamp (swiglu_limit) -- e.g. DeepSeek-V4-Flash NVFP4 -- restrict NVFP4 MoE backend selection to NVFP4_BACKENDS_WITH_CLAMP, which previously contained only FLASHINFER_TRTLLM. The TRTLLM NVFP4 fused MoE only supports Blackwell datacenter GPUs (capability family 100 / SM100), so on SM120 (e.g. RTX PRO 6000) selection fails with "No NvFp4 MoE backend supports the deployment configuration", with no eligible fallback.

FlashInferExperts (flashinfer_cutlass) already applies the clamp -- it builds gemm1_clamp_limit and passes swiglu_limit into flashinfer_cutlass_fused_moe -- and supports SM120. Add it to NVFP4_BACKENDS_WITH_CLAMP so clamped NVFP4 MoE models can run on workstation Blackwell. FLASHINFER_TRTLLM stays first in AVAILABLE_BACKENDS, so SM100 selection is unchanged. This also matches the explicit-backend error message, which already lists flashinfer_cutlass as a clamp-capable option.

Validated on RTX PRO 6000 (SM120) with DeepSeek-V4-Flash NVFP4 (TP2/EP2): the backend selects flashinfer_cutlass, the server reaches ready, and GSM8K exact_match = 0.965, confirming the clamp is numerically honored.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Models that set a SwiGLU clamp (swiglu_limit) -- e.g. DeepSeek-V4-Flash NVFP4 -- restrict NVFP4 MoE backend selection to NVFP4_BACKENDS_WITH_CLAMP, which previously contained only FLASHINFER_TRTLLM. The TRTLLM NVFP4 fused MoE only supports Blackwell datacenter GPUs (capability family 100 / SM100), so on SM120 (e.g. RTX PRO 6000) selection fails with "No NvFp4 MoE backend supports the deployment configuration", with no eligible fallback. FlashInferExperts (flashinfer_cutlass) already applies the clamp -- it builds gemm1_clamp_limit and passes swiglu_limit into flashinfer_cutlass_fused_moe -- and supports SM120. Add it to NVFP4_BACKENDS_WITH_CLAMP so clamped NVFP4 MoE models can run on workstation Blackwell. FLASHINFER_TRTLLM stays first in AVAILABLE_BACKENDS, so SM100 selection is unchanged. This also matches the explicit-backend error message, which already lists flashinfer_cutlass as a clamp-capable option. Validated on RTX PRO 6000 (SM120) with DeepSeek-V4-Flash NVFP4 (TP2/EP2): the backend selects flashinfer_cutlass, the server reaches ready, and GSM8K exact_match = 0.965, confirming the clamp is numerically honored. Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mergify · 2026-06-23T19:24:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lucifer1004.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Michael Goin <mgoin64@gmail.com>

) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> (cherry picked from commit 0d4d164) Signed-off-by: khluu <khluu000@gmail.com>

…m-project#46492) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

…m-project#46492) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

Sync the DeepSeek-V4 SM12x PR branch onto upstream/main (198 commits since the vllm-project#43477 merge). 6 conflicts resolved: - oracle/nvfp4.py: union the clamp set -> {TRTLLM, CUTLASS, MARLIN}. Our FLASHINFER_CUTLASS clamp fix landed upstream as vllm-project#46492; MARLIN added by vllm-project#45836. No fork patch needed anymore. - engine/protocol.py: keep both DeltaMessage hooks (our reasoning_content alias validator + upstream's empty-tool_calls serializer). - test_deepseek_v4_mega_moe.py: keep CompilationConfig() fixture (real config provides static_forward_context; consistent with the other test). - routed_experts.py: combine the two per-tensor-scale loaders -> one helper that does our e8m0 bitwise view AND upstream's 0-D/shape-(1,) normalization. - serve/render/serving.py + renderers/online_renderer.py: upstream vllm-project#44285 split ServingRender into renderer+entrypoint. Re-home our DSv4 thinking->template -kwargs threading: Site A (sampling params) into ServingRender.render_chat_ request via self.online_renderer attrs; Site B (prompt render) into OnlineRenderer.render_chat. - sparse_attn_indexer.py: preserve our SM120 short-row / persistent_topk path (logits_width) and add upstream's cooperative_topk gated to EXCLUDE capability family 120, so SM12x stays byte-identical to the validated path. Enabling cooperative_topk on SM12x is a separate, to-be-validated perf experiment. Inherited for free: deepseek_v2 clone removal (vllm-project#46651), NVFP4 Marlin SwiGLU clamp (vllm-project#45836), sampler int32-overflow fix (vllm-project#46560), spec-decode correctness (vllm-project#45956, vllm-project#46533). Not yet built/validated on GPU. Signed-off-by: jasl <jasl9187@hotmail.com>

…m-project#46492) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

lucifer1004 requested review from mgoin, pavanimajety and zyongye as code owners June 23, 2026 12:31

mergify Bot added nvidia bug Something isn't working labels Jun 23, 2026

github-project-automation Bot added this to NVIDIA Jun 23, 2026

zyongye approved these changes Jun 23, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA Jun 23, 2026

zyongye added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 23, 2026

mgoin approved these changes Jun 23, 2026

View reviewed changes

mgoin enabled auto-merge (squash) June 23, 2026 17:10

zyongye added this to the v0.24.0 cherrypick milestone Jun 23, 2026

mergify Bot added the needs-rebase label Jun 23, 2026

Merge branch 'main' into fix/dsv4-nvfp4

b4e2734

Signed-off-by: Michael Goin <mgoin64@gmail.com>

vllm-bot merged commit 0d4d164 into vllm-project:main Jun 23, 2026
4 of 6 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 23, 2026

jasl mentioned this pull request Jun 26, 2026

[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes #41834

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Allow flashinfer_cutlass as a clamped NVFP4 MoE backend#46492

[Bugfix] Allow flashinfer_cutlass as a clamped NVFP4 MoE backend#46492
vllm-bot merged 2 commits into
vllm-project:mainfrom
lucifer1004:fix/dsv4-nvfp4

lucifer1004 commented Jun 23, 2026 •

edited by github-actions Bot

Loading

mergify Bot commented Jun 23, 2026

Uh oh!

Labels

4 participants

Uh oh!

Uh oh!

Conversation

lucifer1004 commented Jun 23, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mergify Bot commented Jun 23, 2026

Uh oh!

Labels

4 participants

lucifer1004 commented Jun 23, 2026 •

edited by github-actions Bot

Loading