[NVFP4 MoE/Deepseek V4] Marlin: wire SwiGLU clamp + allow it for clamped models on non-Blackwell by mikekg · Pull Request #45836 · vllm-project/vllm

mikekg · 2026-06-16T15:13:23Z

Purpose

Address #45859.

NVFP4 MoE models that declare a SwiGLU clamp (config.swiglu_limit, e.g. DeepSeek-V4) cannot be served on any non-Blackwell GPU (SM80/SM89/SM90). select_nvfp4_moe_backend() raises:

NotImplementedError: No NvFp4 MoE backend supports the deployment configuration.

When swiglu_limit is set, the selector restricts candidates to NVFP4_BACKENDS_WITH_CLAMP, which previously contained only FLASHINFER_TRTLLM (Blackwell/SM100-only). On SM80/89/90 that leaves no candidate. The Marlin NVFP4 MoE backend runs on those architectures and its kernel already applies the SwiGLU clamp (swiglu_limit_func), but it was excluded from the clamp-capable set and was never handed the clamp value.

This PR wires the clamp through to Marlin and allows it for clamped models on non-Blackwell:

nvfp4_w4a16_moe_quant_config() accepts gemm1_clamp_limit and forwards it to FusedMoEQuantConfig.make().
The MARLIN branch of make_nvfp4_moe_quant_config() passes gemm1_clamp_limit=swiglu_limit.
MARLIN is added to NVFP4_BACKENDS_WITH_CLAMP.

All three are required together. Without (1)+(2) the clamp would arrive as None and Marlin would silently skip it, producing unclamped (numerically incorrect) output — so adding MARLIN to the set alone would be unsafe. With all three, MarlinExperts receives gemm1_clamp_limit and the kernel applies swiglu_limit_func (identical math to SiluAndMulWithClamp), reproducing the intended clamped SwiGLU. Unclamped models (swiglu_limit is None) are unaffected — the filter is skipped and Marlin is selected as before.

Test Plan

Serve an NVFP4 MoE checkpoint that sets config.swiglu_limit (DeepSeek-V4-Flash) on a non-Blackwell GPU (SM90 / H100) at TP=4 and TP=8.
Confirm the model now selects the Marlin NVFP4 MoE backend instead of raising NotImplementedError: No NvFp4 MoE backend ....
Issue a /v1/chat/completions request and verify a correct response.

Sample H100 invocation:

  vllm serve nvidia/DeepSeek-V4-Flash-NVFP4 \
    --tokenizer nvidia/DeepSeek-V4-Flash-NVFP4 \
    --served-model-name dsv4flash-nvfp4-sanity \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8 \
    --trust-remote-code \
    --max-model-len 16384 \
    --max-num-batched-tokens 2560 \
    --max-num-seqs 1 \
    --gpu-memory-utilization 0.85 \
    --port 8652 \
    --tokenizer-mode deepseek_v4 \
    --reasoning-parser deepseek_v4 \
    --dtype auto

Test Result

Before: serving the clamped NVFP4 MoE model on non-Blackwell fails at load with No NvFp4 MoE backend supports the deployment configuration.
After: the model loads and serves on H100 (SM90) at TP=4 and TP=8; /v1/chat/completions returns 200 OK with a correct answer (e.g. "What is the capital of Japan?" → "Tokyo").

This change is self-contained (only fused_moe/config.py and fused_moe/oracle/nvfp4.py). Serving DeepSeek-V4 end-to-end additionally requires, separately from this MoE-backend fix: sharing tied weights for quantized embeddings or model that use tied embeddings, and an fp8 KV-cache path for the model's MLA attention. (Available on Hopper but not Ampere today)

…n non-Blackwell NVFP4 MoE models that set a SwiGLU clamp (config.swiglu_limit -- e.g. DeepSeek-V4, which swaps its activation to SiluAndMulWithClamp) fail to load on every non-Blackwell GPU (SM80, SM89, SM90) with: NotImplementedError: No NvFp4 MoE backend supports the deployment configuration. raised by select_nvfp4_moe_backend() in vllm/model_executor/layers/fused_moe/oracle/nvfp4.py. Root cause: when config.swiglu_limit is not None the selector keeps only the backends in NVFP4_BACKENDS_WITH_CLAMP, which contained ONLY FLASHINFER_TRTLLM (a Blackwell/SM100-only kernel). On SM80/89/90 that leaves an empty candidate list -> NotImplementedError. Marlin -- the one NVFP4 MoE backend that runs on those architectures -- was excluded, even though the Marlin kernel already implements the SwiGLU clamp. Models without a clamp are unaffected: swiglu_limit is None, the filter is skipped, and Marlin is selected normally -- which is why unclamped NVFP4 MoE models already run on SM80/SM90 while clamped ones (DeepSeek-V4) did not. The changes (3 wiring points; all required together): 1) config.py -- nvfp4_w4a16_moe_quant_config() now accepts gemm1_clamp_limit and forwards it to FusedMoEQuantConfig.make(). This is the quant-config constructor used by the Marlin NVFP4 path; it previously dropped the clamp, so quant_config.gemm1_clamp_limit was always None for Marlin. 2) oracle/nvfp4.py -- make_nvfp4_moe_quant_config(), MARLIN branch: pass gemm1_clamp_limit=swiglu_limit into nvfp4_w4a16_moe_quant_config, so the model's clamp actually reaches the kernel. 3) oracle/nvfp4.py -- add MARLIN to NVFP4_BACKENDS_WITH_CLAMP so the selector no longer discards Marlin when swiglu_limit is set. Why this is correct: The Marlin MoE kernel already implements the clamp; it was just never wired to receive it. In experts/marlin_moe.py, MarlinExperts reads self.gemm1_clamp_limit = quant_config.gemm1_clamp_limit and passes clamp_limit=self.gemm1_clamp_limit into the kernel, which does if clamp_limit is not None and activation == MoEActivation.SILU: swiglu_limit_func(intermediate_cache2, intermediate_cache1.view(-1, w13_num_shards * N), clamp_limit) swiglu_limit_func is the same routine backing SiluAndMulWithClamp, with identical math: gate_clamped = min(SiLU(gate), limit); up_clamped = min(max(up, -limit), limit); out = gate_clamped * up_clamped. So once gemm1_clamp_limit is delivered, Marlin reproduces the model's intended clamped SwiGLU exactly, preserving the numerical-stability guard the clamp exists for (bounding FP4 activation outliers). Without changes (1)+(2) the clamp would arrive as None and Marlin would silently skip it, producing unclamped (numerically wrong) output; that is why merely adding MARLIN to the set would be unsafe on its own, and why all three changes go together. End-to-end clamp path: config.swiglu_limit -> select_nvfp4_moe_backend keeps Marlin [change 3] -> make_nvfp4_moe_quant_config(MARLIN, swiglu_limit) -> nvfp4_w4a16_moe_quant_config(gemm1_clamp_limit=swiglu_limit) [change 2] -> FusedMoEQuantConfig.gemm1_clamp_limit = swiglu_limit [change 1] -> MarlinExperts.gemm1_clamp_limit -> kernel clamp_limit -> swiglu_limit_func (clamped SwiGLU) Testing: - Confirmed the Marlin kernel applies the clamp via swiglu_limit_func, gated on MoEActivation.SILU; math identical to SiluAndMulWithClamp. - Confirmed a clamped NVFP4 MoE model (DeepSeek-V4) sets config.swiglu_limit whereas an unclamped one does not, explaining the prior non-Blackwell asymmetry. - Served DeepSeek-V4-Flash (NVFP4) on H100 (SM90) at TP=4 and TP=8: the model now selects the Marlin NVFP4 MoE backend and /v1/chat/completions returns correct results, where it previously raised the NotImplementedError above. Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>

mikekg · 2026-06-23T17:22:14Z

Thanks @mgoin @pavanimajety — addressed:

Dropped the unit test entirely (per @mgoin). The PR is now just the 4-line
wiring — gemm1_clamp_limit plumbing in config.py and adding MARLIN to
NVFP4_BACKENDS_WITH_CLAMP in oracle/nvfp4.py. That also moots the nvfp4_utils
note (@pavanimajety) since there's no quantizer in the diff anymore.
Removed the explanatory comment in oracle/nvfp4.py.
Rebased on main.

For the record, I validated the clamp wiring out-of-tree rather than as a
committed test: the Marlin clamp path (fused_marlin_moe(..., clamp_limit=L) →
in-kernel swiglu_limit_func, identical math to SiluAndMulWithClamp(L, alpha=1,
beta=0), matching DeepSeek-V4-Flash Expert.forward) was cross-checked at dsv4
operating points (k=4096, n=2048, topk=6; clamp ∈ {5,7,10,15}; weight_scale_2
∈ {2⁻¹³…2⁻⁹}) against a torch SiluAndMulWithClamp reference and the
FlashInfer TRT-LLM NVFP4 kernel — Marlin matched the torch reference to rel-L2
≈ 0.003–0.007 on A100/H100/B200, and ≈ 0.004 vs real dsv4 expert weights from
the checkpoint. The e2e H100 serving result is in the description.

pavanimajety

thanks, lgtm now.

mgoin

Thanks, great

…ped models on non-Blackwell (vllm-project#45836) Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>

…ped models on non-Blackwell (#45836) Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com> (cherry picked from commit 0775b88) Signed-off-by: khluu <khluu000@gmail.com>

…ped models on non-Blackwell (vllm-project#45836) Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

Sync the DeepSeek-V4 SM12x PR branch onto upstream/main (198 commits since the vllm-project#43477 merge). 6 conflicts resolved: - oracle/nvfp4.py: union the clamp set -> {TRTLLM, CUTLASS, MARLIN}. Our FLASHINFER_CUTLASS clamp fix landed upstream as vllm-project#46492; MARLIN added by vllm-project#45836. No fork patch needed anymore. - engine/protocol.py: keep both DeltaMessage hooks (our reasoning_content alias validator + upstream's empty-tool_calls serializer). - test_deepseek_v4_mega_moe.py: keep CompilationConfig() fixture (real config provides static_forward_context; consistent with the other test). - routed_experts.py: combine the two per-tensor-scale loaders -> one helper that does our e8m0 bitwise view AND upstream's 0-D/shape-(1,) normalization. - serve/render/serving.py + renderers/online_renderer.py: upstream vllm-project#44285 split ServingRender into renderer+entrypoint. Re-home our DSv4 thinking->template -kwargs threading: Site A (sampling params) into ServingRender.render_chat_ request via self.online_renderer attrs; Site B (prompt render) into OnlineRenderer.render_chat. - sparse_attn_indexer.py: preserve our SM120 short-row / persistent_topk path (logits_width) and add upstream's cooperative_topk gated to EXCLUDE capability family 120, so SM12x stays byte-identical to the validated path. Enabling cooperative_topk on SM12x is a separate, to-be-validated perf experiment. Inherited for free: deepseek_v2 clone removal (vllm-project#46651), NVFP4 Marlin SwiGLU clamp (vllm-project#45836), sampler int32-overflow fix (vllm-project#46560), spec-decode correctness (vllm-project#45956, vllm-project#46533). Not yet built/validated on GPU. Signed-off-by: jasl <jasl9187@hotmail.com>

…ped models on non-Blackwell (vllm-project#45836) Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>

mikekg requested review from mgoin, pavanimajety and zyongye as code owners June 16, 2026 15:13

mikekg changed the title ~~[NVFP4 MoE] Marlin: wire SwiGLU clamp + allow it for clamped models on non-Blackwell~~ Jun 16, 2026

mikekg mentioned this pull request Jun 16, 2026

[Feature]: Wire swiglu_limit from model config into MoE kernel dispatch for DeepSeek V4 and other models #45859

Open

mikekg force-pushed the nvfp4-moe-marlin-swiglu-clamp branch from 1b2aab7 to 96e98a7 Compare June 16, 2026 23:56

mikekg requested review from AndreasKaratzas, WoosukKwon, tlrmchlsmth and yewentao256 as code owners June 17, 2026 22:29

mikekg mentioned this pull request Jun 17, 2026

[NVFP4 MoE/DSV4] Test: Marlin NVFP4 MoE init + clamp on real DeepSeek-V4-Flash weights #45968

Closed

mikekg changed the title ~~[NVFP4 MoE/DSV4] Marlin: wire SwiGLU clamp + allow it for clamped models on non-Blackwell~~ Jun 18, 2026

mergify Bot added the deepseek Related to DeepSeek models label Jun 18, 2026

mikekg force-pushed the nvfp4-moe-marlin-swiglu-clamp branch 4 times, most recently from 15b88c3 to 68fdc08 Compare June 23, 2026 03:50

pavanimajety added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 23, 2026

pavanimajety reviewed Jun 23, 2026

View reviewed changes

Comment thread tests/kernels/moe/test_marlin_nvfp4_swiglu_clamp.py Outdated

mgoin reviewed Jun 23, 2026

View reviewed changes

Comment thread tests/kernels/moe/test_marlin_nvfp4_swiglu_clamp.py Outdated

Comment thread vllm/model_executor/layers/fused_moe/oracle/nvfp4.py Outdated

mikekg force-pushed the nvfp4-moe-marlin-swiglu-clamp branch from 68fdc08 to efa1d03 Compare June 23, 2026 17:18

pavanimajety approved these changes Jun 23, 2026

View reviewed changes

mgoin approved these changes Jun 23, 2026

View reviewed changes

vllm-bot merged commit 0775b88 into vllm-project:main Jun 23, 2026
92 of 95 checks passed

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[NVFP4 MoE/Deepseek V4] Marlin: wire SwiGLU clamp + allow it for clam…

69c6625

…ped models on non-Blackwell (vllm-project#45836) Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>

zyongye added this to the v0.24.0 cherrypick milestone Jun 24, 2026

wincent8 pushed a commit to wincent8/vllm that referenced this pull request Jun 29, 2026

[NVFP4 MoE/Deepseek V4] Marlin: wire SwiGLU clamp + allow it for clam…

12b115b

…ped models on non-Blackwell (vllm-project#45836) Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>

This was referenced Jun 30, 2026

[Backport][NVFP4] ds4-sm120-* hardcodes Mxfp4MoEMethod — DeepSeek-V4-Flash-NVFP4 fails to load on SM120 (works via ModelOpt→Marlin) jasl/vllm#24

Open

[Bug]: NotImplementedError: No NvFp4 MoE backend supports the deployment configuration. #45661

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[NVFP4 MoE/Deepseek V4] Marlin: wire SwiGLU clamp + allow it for clamped models on non-Blackwell#45836

[NVFP4 MoE/Deepseek V4] Marlin: wire SwiGLU clamp + allow it for clamped models on non-Blackwell#45836
vllm-bot merged 1 commit into
vllm-project:mainfrom
mikekg:nvfp4-moe-marlin-swiglu-clamp

mikekg commented Jun 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

mikekg commented Jun 23, 2026

pavanimajety left a comment

mgoin left a comment

Uh oh!

Labels

5 participants

Uh oh!

Uh oh!

Conversation

mikekg commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Uh oh!

Uh oh!

mikekg commented Jun 23, 2026

pavanimajety left a comment

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

5 participants

mikekg commented Jun 16, 2026 •

edited

Loading