[Bugfix] Allow flashinfer_cutlass as a clamped NVFP4 MoE backend#46492
Merged
Conversation
Models that set a SwiGLU clamp (swiglu_limit) -- e.g. DeepSeek-V4-Flash NVFP4 -- restrict NVFP4 MoE backend selection to NVFP4_BACKENDS_WITH_CLAMP, which previously contained only FLASHINFER_TRTLLM. The TRTLLM NVFP4 fused MoE only supports Blackwell datacenter GPUs (capability family 100 / SM100), so on SM120 (e.g. RTX PRO 6000) selection fails with "No NvFp4 MoE backend supports the deployment configuration", with no eligible fallback. FlashInferExperts (flashinfer_cutlass) already applies the clamp -- it builds gemm1_clamp_limit and passes swiglu_limit into flashinfer_cutlass_fused_moe -- and supports SM120. Add it to NVFP4_BACKENDS_WITH_CLAMP so clamped NVFP4 MoE models can run on workstation Blackwell. FLASHINFER_TRTLLM stays first in AVAILABLE_BACKENDS, so SM100 selection is unchanged. This also matches the explicit-backend error message, which already lists flashinfer_cutlass as a clamp-capable option. Validated on RTX PRO 6000 (SM120) with DeepSeek-V4-Flash NVFP4 (TP2/EP2): the backend selects flashinfer_cutlass, the server reaches ready, and GSM8K exact_match = 0.965, confirming the clamp is numerically honored. Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
zyongye
approved these changes
Jun 23, 2026
mgoin
approved these changes
Jun 23, 2026
Contributor
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Michael Goin <mgoin64@gmail.com>
khluu
pushed a commit
that referenced
this pull request
Jun 24, 2026
) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> (cherry picked from commit 0d4d164) Signed-off-by: khluu <khluu000@gmail.com>
nkzhenhua
pushed a commit
to nkzhenhua/vllm
that referenced
this pull request
Jun 24, 2026
…m-project#46492) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
qli88
pushed a commit
to qli88/vllm
that referenced
this pull request
Jun 26, 2026
…m-project#46492) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>
jasl
added a commit
to jasl/vllm
that referenced
this pull request
Jun 27, 2026
Sync the DeepSeek-V4 SM12x PR branch onto upstream/main (198 commits since the vllm-project#43477 merge). 6 conflicts resolved: - oracle/nvfp4.py: union the clamp set -> {TRTLLM, CUTLASS, MARLIN}. Our FLASHINFER_CUTLASS clamp fix landed upstream as vllm-project#46492; MARLIN added by vllm-project#45836. No fork patch needed anymore. - engine/protocol.py: keep both DeltaMessage hooks (our reasoning_content alias validator + upstream's empty-tool_calls serializer). - test_deepseek_v4_mega_moe.py: keep CompilationConfig() fixture (real config provides static_forward_context; consistent with the other test). - routed_experts.py: combine the two per-tensor-scale loaders -> one helper that does our e8m0 bitwise view AND upstream's 0-D/shape-(1,) normalization. - serve/render/serving.py + renderers/online_renderer.py: upstream vllm-project#44285 split ServingRender into renderer+entrypoint. Re-home our DSv4 thinking->template -kwargs threading: Site A (sampling params) into ServingRender.render_chat_ request via self.online_renderer attrs; Site B (prompt render) into OnlineRenderer.render_chat. - sparse_attn_indexer.py: preserve our SM120 short-row / persistent_topk path (logits_width) and add upstream's cooperative_topk gated to EXCLUDE capability family 120, so SM12x stays byte-identical to the validated path. Enabling cooperative_topk on SM12x is a separate, to-be-validated perf experiment. Inherited for free: deepseek_v2 clone removal (vllm-project#46651), NVFP4 Marlin SwiGLU clamp (vllm-project#45836), sampler int32-overflow fix (vllm-project#46560), spec-decode correctness (vllm-project#45956, vllm-project#46533). Not yet built/validated on GPU. Signed-off-by: jasl <jasl9187@hotmail.com>
wincent8
pushed a commit
to wincent8/vllm
that referenced
this pull request
Jun 29, 2026
…m-project#46492) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Models that set a SwiGLU clamp (swiglu_limit) -- e.g. DeepSeek-V4-Flash NVFP4 -- restrict NVFP4 MoE backend selection to NVFP4_BACKENDS_WITH_CLAMP, which previously contained only FLASHINFER_TRTLLM. The TRTLLM NVFP4 fused MoE only supports Blackwell datacenter GPUs (capability family 100 / SM100), so on SM120 (e.g. RTX PRO 6000) selection fails with "No NvFp4 MoE backend supports the deployment configuration", with no eligible fallback.
FlashInferExperts (flashinfer_cutlass) already applies the clamp -- it builds gemm1_clamp_limit and passes swiglu_limit into flashinfer_cutlass_fused_moe -- and supports SM120. Add it to NVFP4_BACKENDS_WITH_CLAMP so clamped NVFP4 MoE models can run on workstation Blackwell. FLASHINFER_TRTLLM stays first in AVAILABLE_BACKENDS, so SM100 selection is unchanged. This also matches the explicit-backend error message, which already lists flashinfer_cutlass as a clamp-capable option.
Validated on RTX PRO 6000 (SM120) with DeepSeek-V4-Flash NVFP4 (TP2/EP2): the backend selects flashinfer_cutlass, the server reaches ready, and GSM8K exact_match = 0.965, confirming the clamp is numerically honored.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.