[NVFP4 MoE/Deepseek V4] Marlin: wire SwiGLU clamp + allow it for clamped models on non-Blackwell#45836
Merged
Conversation
1b2aab7 to
96e98a7
Compare
15b88c3 to
68fdc08
Compare
mgoin
reviewed
Jun 23, 2026
…n non-Blackwell
NVFP4 MoE models that set a SwiGLU clamp (config.swiglu_limit -- e.g.
DeepSeek-V4, which swaps its activation to SiluAndMulWithClamp) fail to load on
every non-Blackwell GPU (SM80, SM89, SM90) with:
NotImplementedError: No NvFp4 MoE backend supports the deployment configuration.
raised by select_nvfp4_moe_backend() in
vllm/model_executor/layers/fused_moe/oracle/nvfp4.py.
Root cause: when config.swiglu_limit is not None the selector keeps only the
backends in NVFP4_BACKENDS_WITH_CLAMP, which contained ONLY FLASHINFER_TRTLLM
(a Blackwell/SM100-only kernel). On SM80/89/90 that leaves an empty candidate
list -> NotImplementedError. Marlin -- the one NVFP4 MoE backend that runs on
those architectures -- was excluded, even though the Marlin kernel already
implements the SwiGLU clamp. Models without a clamp are unaffected:
swiglu_limit is None, the filter is skipped, and Marlin is selected normally --
which is why unclamped NVFP4 MoE models already run on SM80/SM90 while clamped
ones (DeepSeek-V4) did not.
The changes (3 wiring points; all required together):
1) config.py -- nvfp4_w4a16_moe_quant_config() now accepts gemm1_clamp_limit
and forwards it to FusedMoEQuantConfig.make(). This is the quant-config
constructor used by the Marlin NVFP4 path; it previously dropped the clamp,
so quant_config.gemm1_clamp_limit was always None for Marlin.
2) oracle/nvfp4.py -- make_nvfp4_moe_quant_config(), MARLIN branch: pass
gemm1_clamp_limit=swiglu_limit into nvfp4_w4a16_moe_quant_config, so the
model's clamp actually reaches the kernel.
3) oracle/nvfp4.py -- add MARLIN to NVFP4_BACKENDS_WITH_CLAMP so the selector
no longer discards Marlin when swiglu_limit is set.
Why this is correct:
The Marlin MoE kernel already implements the clamp; it was just never wired to
receive it. In experts/marlin_moe.py, MarlinExperts reads
self.gemm1_clamp_limit = quant_config.gemm1_clamp_limit and passes
clamp_limit=self.gemm1_clamp_limit into the kernel, which does
if clamp_limit is not None and activation == MoEActivation.SILU:
swiglu_limit_func(intermediate_cache2,
intermediate_cache1.view(-1, w13_num_shards * N),
clamp_limit)
swiglu_limit_func is the same routine backing SiluAndMulWithClamp, with
identical math: gate_clamped = min(SiLU(gate), limit);
up_clamped = min(max(up, -limit), limit); out = gate_clamped * up_clamped.
So once gemm1_clamp_limit is delivered, Marlin reproduces the model's intended
clamped SwiGLU exactly, preserving the numerical-stability guard the clamp
exists for (bounding FP4 activation outliers). Without changes (1)+(2) the
clamp would arrive as None and Marlin would silently skip it, producing
unclamped (numerically wrong) output; that is why merely adding MARLIN to the
set would be unsafe on its own, and why all three changes go together.
End-to-end clamp path:
config.swiglu_limit
-> select_nvfp4_moe_backend keeps Marlin [change 3]
-> make_nvfp4_moe_quant_config(MARLIN, swiglu_limit)
-> nvfp4_w4a16_moe_quant_config(gemm1_clamp_limit=swiglu_limit) [change 2]
-> FusedMoEQuantConfig.gemm1_clamp_limit = swiglu_limit [change 1]
-> MarlinExperts.gemm1_clamp_limit
-> kernel clamp_limit -> swiglu_limit_func (clamped SwiGLU)
Testing:
- Confirmed the Marlin kernel applies the clamp via swiglu_limit_func, gated on
MoEActivation.SILU; math identical to SiluAndMulWithClamp.
- Confirmed a clamped NVFP4 MoE model (DeepSeek-V4) sets config.swiglu_limit
whereas an unclamped one does not, explaining the prior non-Blackwell
asymmetry.
- Served DeepSeek-V4-Flash (NVFP4) on H100 (SM90) at TP=4 and TP=8: the model
now selects the Marlin NVFP4 MoE backend and /v1/chat/completions returns
correct results, where it previously raised the NotImplementedError above.
Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
68fdc08 to
efa1d03
Compare
Contributor
Author
|
Thanks @mgoin @pavanimajety — addressed:
For the record, I validated the clamp wiring out-of-tree rather than as a |
nkzhenhua
pushed a commit
to nkzhenhua/vllm
that referenced
this pull request
Jun 24, 2026
…ped models on non-Blackwell (vllm-project#45836) Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
qli88
pushed a commit
to qli88/vllm
that referenced
this pull request
Jun 26, 2026
…ped models on non-Blackwell (vllm-project#45836) Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>
jasl
added a commit
to jasl/vllm
that referenced
this pull request
Jun 27, 2026
Sync the DeepSeek-V4 SM12x PR branch onto upstream/main (198 commits since the vllm-project#43477 merge). 6 conflicts resolved: - oracle/nvfp4.py: union the clamp set -> {TRTLLM, CUTLASS, MARLIN}. Our FLASHINFER_CUTLASS clamp fix landed upstream as vllm-project#46492; MARLIN added by vllm-project#45836. No fork patch needed anymore. - engine/protocol.py: keep both DeltaMessage hooks (our reasoning_content alias validator + upstream's empty-tool_calls serializer). - test_deepseek_v4_mega_moe.py: keep CompilationConfig() fixture (real config provides static_forward_context; consistent with the other test). - routed_experts.py: combine the two per-tensor-scale loaders -> one helper that does our e8m0 bitwise view AND upstream's 0-D/shape-(1,) normalization. - serve/render/serving.py + renderers/online_renderer.py: upstream vllm-project#44285 split ServingRender into renderer+entrypoint. Re-home our DSv4 thinking->template -kwargs threading: Site A (sampling params) into ServingRender.render_chat_ request via self.online_renderer attrs; Site B (prompt render) into OnlineRenderer.render_chat. - sparse_attn_indexer.py: preserve our SM120 short-row / persistent_topk path (logits_width) and add upstream's cooperative_topk gated to EXCLUDE capability family 120, so SM12x stays byte-identical to the validated path. Enabling cooperative_topk on SM12x is a separate, to-be-validated perf experiment. Inherited for free: deepseek_v2 clone removal (vllm-project#46651), NVFP4 Marlin SwiGLU clamp (vllm-project#45836), sampler int32-overflow fix (vllm-project#46560), spec-decode correctness (vllm-project#45956, vllm-project#46533). Not yet built/validated on GPU. Signed-off-by: jasl <jasl9187@hotmail.com>
wincent8
pushed a commit
to wincent8/vllm
that referenced
this pull request
Jun 29, 2026
…ped models on non-Blackwell (vllm-project#45836) Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Address #45859.
NVFP4 MoE models that declare a SwiGLU clamp (
config.swiglu_limit, e.g. DeepSeek-V4) cannot be served on any non-Blackwell GPU (SM80/SM89/SM90).select_nvfp4_moe_backend()raises:When
swiglu_limitis set, the selector restricts candidates toNVFP4_BACKENDS_WITH_CLAMP, which previously contained onlyFLASHINFER_TRTLLM(Blackwell/SM100-only). On SM80/89/90 that leaves no candidate. The Marlin NVFP4 MoE backend runs on those architectures and its kernel already applies the SwiGLU clamp (swiglu_limit_func), but it was excluded from the clamp-capable set and was never handed the clamp value.This PR wires the clamp through to Marlin and allows it for clamped models on non-Blackwell:
nvfp4_w4a16_moe_quant_config()acceptsgemm1_clamp_limitand forwards it toFusedMoEQuantConfig.make().MARLINbranch ofmake_nvfp4_moe_quant_config()passesgemm1_clamp_limit=swiglu_limit.MARLINis added toNVFP4_BACKENDS_WITH_CLAMP.All three are required together. Without (1)+(2) the clamp would arrive as
Noneand Marlin would silently skip it, producing unclamped (numerically incorrect) output — so addingMARLINto the set alone would be unsafe. With all three,MarlinExpertsreceivesgemm1_clamp_limitand the kernel appliesswiglu_limit_func(identical math toSiluAndMulWithClamp), reproducing the intended clamped SwiGLU. Unclamped models (swiglu_limit is None) are unaffected — the filter is skipped and Marlin is selected as before.Test Plan
config.swiglu_limit(DeepSeek-V4-Flash) on a non-Blackwell GPU (SM90 / H100) at TP=4 and TP=8.NotImplementedError: No NvFp4 MoE backend ..../v1/chat/completionsrequest and verify a correct response.Sample H100 invocation:
Test Result
No NvFp4 MoE backend supports the deployment configuration./v1/chat/completionsreturns200 OKwith a correct answer (e.g. "What is the capital of Japan?" → "Tokyo").This change is self-contained (only
fused_moe/config.pyandfused_moe/oracle/nvfp4.py). Serving DeepSeek-V4 end-to-end additionally requires, separately from this MoE-backend fix: sharing tied weights for quantized embeddings or model that use tied embeddings, and an fp8 KV-cache path for the model's MLA attention. (Available on Hopper but not Ampere today)