Skip to content

[Bugfix] Allow flashinfer_cutlass as a clamped NVFP4 MoE backend#46492

Merged
vllm-bot merged 2 commits into
vllm-project:mainfrom
lucifer1004:fix/dsv4-nvfp4
Jun 23, 2026
Merged

[Bugfix] Allow flashinfer_cutlass as a clamped NVFP4 MoE backend#46492
vllm-bot merged 2 commits into
vllm-project:mainfrom
lucifer1004:fix/dsv4-nvfp4

Conversation

@lucifer1004

@lucifer1004 lucifer1004 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Models that set a SwiGLU clamp (swiglu_limit) -- e.g. DeepSeek-V4-Flash NVFP4 -- restrict NVFP4 MoE backend selection to NVFP4_BACKENDS_WITH_CLAMP, which previously contained only FLASHINFER_TRTLLM. The TRTLLM NVFP4 fused MoE only supports Blackwell datacenter GPUs (capability family 100 / SM100), so on SM120 (e.g. RTX PRO 6000) selection fails with "No NvFp4 MoE backend supports the deployment configuration", with no eligible fallback.

FlashInferExperts (flashinfer_cutlass) already applies the clamp -- it builds gemm1_clamp_limit and passes swiglu_limit into flashinfer_cutlass_fused_moe -- and supports SM120. Add it to NVFP4_BACKENDS_WITH_CLAMP so clamped NVFP4 MoE models can run on workstation Blackwell. FLASHINFER_TRTLLM stays first in AVAILABLE_BACKENDS, so SM100 selection is unchanged. This also matches the explicit-backend error message, which already lists flashinfer_cutlass as a clamp-capable option.

Validated on RTX PRO 6000 (SM120) with DeepSeek-V4-Flash NVFP4 (TP2/EP2): the backend selects flashinfer_cutlass, the server reaches ready, and GSM8K exact_match = 0.965, confirming the clamp is numerically honored.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
Models that set a SwiGLU clamp (swiglu_limit) -- e.g. DeepSeek-V4-Flash
NVFP4 -- restrict NVFP4 MoE backend selection to NVFP4_BACKENDS_WITH_CLAMP,
which previously contained only FLASHINFER_TRTLLM. The TRTLLM NVFP4 fused
MoE only supports Blackwell datacenter GPUs (capability family 100 / SM100),
so on SM120 (e.g. RTX PRO 6000) selection fails with "No NvFp4 MoE backend
supports the deployment configuration", with no eligible fallback.

FlashInferExperts (flashinfer_cutlass) already applies the clamp -- it builds
gemm1_clamp_limit and passes swiglu_limit into flashinfer_cutlass_fused_moe --
and supports SM120. Add it to NVFP4_BACKENDS_WITH_CLAMP so clamped NVFP4 MoE
models can run on workstation Blackwell. FLASHINFER_TRTLLM stays first in
AVAILABLE_BACKENDS, so SM100 selection is unchanged. This also matches the
explicit-backend error message, which already lists flashinfer_cutlass as a
clamp-capable option.

Validated on RTX PRO 6000 (SM120) with DeepSeek-V4-Flash NVFP4 (TP2/EP2):
the backend selects flashinfer_cutlass, the server reaches ready, and GSM8K
exact_match = 0.965, confirming the clamp is numerically honored.

Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mergify mergify Bot added nvidia bug Something isn't working labels Jun 23, 2026
@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Jun 23, 2026
@zyongye zyongye added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 23, 2026
@mgoin mgoin enabled auto-merge (squash) June 23, 2026 17:10
@zyongye zyongye added this to the v0.24.0 cherrypick milestone Jun 23, 2026
@mergify

mergify Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lucifer1004.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 23, 2026
Signed-off-by: Michael Goin <mgoin64@gmail.com>
@vllm-bot vllm-bot merged commit 0d4d164 into vllm-project:main Jun 23, 2026
4 of 6 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 23, 2026
khluu pushed a commit that referenced this pull request Jun 24, 2026
)

Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
(cherry picked from commit 0d4d164)

Signed-off-by: khluu <khluu000@gmail.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…m-project#46492)

Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
…m-project#46492)

Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
jasl added a commit to jasl/vllm that referenced this pull request Jun 27, 2026
Sync the DeepSeek-V4 SM12x PR branch onto upstream/main (198 commits since
the vllm-project#43477 merge). 6 conflicts resolved:

- oracle/nvfp4.py: union the clamp set -> {TRTLLM, CUTLASS, MARLIN}. Our
  FLASHINFER_CUTLASS clamp fix landed upstream as vllm-project#46492; MARLIN added by
  vllm-project#45836. No fork patch needed anymore.
- engine/protocol.py: keep both DeltaMessage hooks (our reasoning_content
  alias validator + upstream's empty-tool_calls serializer).
- test_deepseek_v4_mega_moe.py: keep CompilationConfig() fixture (real config
  provides static_forward_context; consistent with the other test).
- routed_experts.py: combine the two per-tensor-scale loaders -> one helper
  that does our e8m0 bitwise view AND upstream's 0-D/shape-(1,) normalization.
- serve/render/serving.py + renderers/online_renderer.py: upstream vllm-project#44285 split
  ServingRender into renderer+entrypoint. Re-home our DSv4 thinking->template
  -kwargs threading: Site A (sampling params) into ServingRender.render_chat_
  request via self.online_renderer attrs; Site B (prompt render) into
  OnlineRenderer.render_chat.
- sparse_attn_indexer.py: preserve our SM120 short-row / persistent_topk path
  (logits_width) and add upstream's cooperative_topk gated to EXCLUDE capability
  family 120, so SM12x stays byte-identical to the validated path. Enabling
  cooperative_topk on SM12x is a separate, to-be-validated perf experiment.

Inherited for free: deepseek_v2 clone removal (vllm-project#46651), NVFP4 Marlin SwiGLU
clamp (vllm-project#45836), sampler int32-overflow fix (vllm-project#46560), spec-decode correctness
(vllm-project#45956, vllm-project#46533). Not yet built/validated on GPU.

Signed-off-by: jasl <jasl9187@hotmail.com>
wincent8 pushed a commit to wincent8/vllm that referenced this pull request Jun 29, 2026
…m-project#46492)

Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase nvidia ready ONLY add when PR is ready to merge/full CI is needed

4 participants