Skip to content

[NVFP4 MoE/Deepseek V4] Marlin: wire SwiGLU clamp + allow it for clamped models on non-Blackwell#45836

Merged
vllm-bot merged 1 commit into
vllm-project:mainfrom
mikekg:nvfp4-moe-marlin-swiglu-clamp
Jun 23, 2026
Merged

[NVFP4 MoE/Deepseek V4] Marlin: wire SwiGLU clamp + allow it for clamped models on non-Blackwell#45836
vllm-bot merged 1 commit into
vllm-project:mainfrom
mikekg:nvfp4-moe-marlin-swiglu-clamp

Conversation

@mikekg

@mikekg mikekg commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Purpose

Address #45859.

NVFP4 MoE models that declare a SwiGLU clamp (config.swiglu_limit, e.g. DeepSeek-V4) cannot be served on any non-Blackwell GPU (SM80/SM89/SM90). select_nvfp4_moe_backend() raises:

NotImplementedError: No NvFp4 MoE backend supports the deployment configuration.

When swiglu_limit is set, the selector restricts candidates to NVFP4_BACKENDS_WITH_CLAMP, which previously contained only FLASHINFER_TRTLLM (Blackwell/SM100-only). On SM80/89/90 that leaves no candidate. The Marlin NVFP4 MoE backend runs on those architectures and its kernel already applies the SwiGLU clamp (swiglu_limit_func), but it was excluded from the clamp-capable set and was never handed the clamp value.

This PR wires the clamp through to Marlin and allows it for clamped models on non-Blackwell:

  1. nvfp4_w4a16_moe_quant_config() accepts gemm1_clamp_limit and forwards it to FusedMoEQuantConfig.make().
  2. The MARLIN branch of make_nvfp4_moe_quant_config() passes gemm1_clamp_limit=swiglu_limit.
  3. MARLIN is added to NVFP4_BACKENDS_WITH_CLAMP.

All three are required together. Without (1)+(2) the clamp would arrive as None and Marlin would silently skip it, producing unclamped (numerically incorrect) output — so adding MARLIN to the set alone would be unsafe. With all three, MarlinExperts receives gemm1_clamp_limit and the kernel applies swiglu_limit_func (identical math to SiluAndMulWithClamp), reproducing the intended clamped SwiGLU. Unclamped models (swiglu_limit is None) are unaffected — the filter is skipped and Marlin is selected as before.

Test Plan

  • Serve an NVFP4 MoE checkpoint that sets config.swiglu_limit (DeepSeek-V4-Flash) on a non-Blackwell GPU (SM90 / H100) at TP=4 and TP=8.
  • Confirm the model now selects the Marlin NVFP4 MoE backend instead of raising NotImplementedError: No NvFp4 MoE backend ....
  • Issue a /v1/chat/completions request and verify a correct response.

Sample H100 invocation:

  vllm serve nvidia/DeepSeek-V4-Flash-NVFP4 \
    --tokenizer nvidia/DeepSeek-V4-Flash-NVFP4 \
    --served-model-name dsv4flash-nvfp4-sanity \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8 \
    --trust-remote-code \
    --max-model-len 16384 \
    --max-num-batched-tokens 2560 \
    --max-num-seqs 1 \
    --gpu-memory-utilization 0.85 \
    --port 8652 \
    --tokenizer-mode deepseek_v4 \
    --reasoning-parser deepseek_v4 \
    --dtype auto

Test Result

  • Before: serving the clamped NVFP4 MoE model on non-Blackwell fails at load with No NvFp4 MoE backend supports the deployment configuration.
  • After: the model loads and serves on H100 (SM90) at TP=4 and TP=8; /v1/chat/completions returns 200 OK with a correct answer (e.g. "What is the capital of Japan?""Tokyo").

This change is self-contained (only fused_moe/config.py and fused_moe/oracle/nvfp4.py). Serving DeepSeek-V4 end-to-end additionally requires, separately from this MoE-backend fix: sharing tied weights for quantized embeddings or model that use tied embeddings, and an fp8 KV-cache path for the model's MLA attention. (Available on Hopper but not Ampere today)

@mikekg mikekg changed the title [NVFP4 MoE] Marlin: wire SwiGLU clamp + allow it for clamped models on non-Blackwell Jun 16, 2026
@mikekg mikekg force-pushed the nvfp4-moe-marlin-swiglu-clamp branch from 1b2aab7 to 96e98a7 Compare June 16, 2026 23:56
@mikekg mikekg changed the title [NVFP4 MoE/DSV4] Marlin: wire SwiGLU clamp + allow it for clamped models on non-Blackwell Jun 18, 2026
@mergify mergify Bot added the deepseek Related to DeepSeek models label Jun 18, 2026
@mikekg mikekg force-pushed the nvfp4-moe-marlin-swiglu-clamp branch 4 times, most recently from 15b88c3 to 68fdc08 Compare June 23, 2026 03:50
@pavanimajety pavanimajety added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 23, 2026
Comment thread tests/kernels/moe/test_marlin_nvfp4_swiglu_clamp.py Outdated
Comment thread tests/kernels/moe/test_marlin_nvfp4_swiglu_clamp.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/oracle/nvfp4.py Outdated
…n non-Blackwell

NVFP4 MoE models that set a SwiGLU clamp (config.swiglu_limit -- e.g.
DeepSeek-V4, which swaps its activation to SiluAndMulWithClamp) fail to load on
every non-Blackwell GPU (SM80, SM89, SM90) with:

    NotImplementedError: No NvFp4 MoE backend supports the deployment configuration.

raised by select_nvfp4_moe_backend() in
vllm/model_executor/layers/fused_moe/oracle/nvfp4.py.

Root cause: when config.swiglu_limit is not None the selector keeps only the
backends in NVFP4_BACKENDS_WITH_CLAMP, which contained ONLY FLASHINFER_TRTLLM
(a Blackwell/SM100-only kernel). On SM80/89/90 that leaves an empty candidate
list -> NotImplementedError. Marlin -- the one NVFP4 MoE backend that runs on
those architectures -- was excluded, even though the Marlin kernel already
implements the SwiGLU clamp. Models without a clamp are unaffected:
swiglu_limit is None, the filter is skipped, and Marlin is selected normally --
which is why unclamped NVFP4 MoE models already run on SM80/SM90 while clamped
ones (DeepSeek-V4) did not.

The changes (3 wiring points; all required together):

1) config.py -- nvfp4_w4a16_moe_quant_config() now accepts gemm1_clamp_limit
   and forwards it to FusedMoEQuantConfig.make(). This is the quant-config
   constructor used by the Marlin NVFP4 path; it previously dropped the clamp,
   so quant_config.gemm1_clamp_limit was always None for Marlin.

2) oracle/nvfp4.py -- make_nvfp4_moe_quant_config(), MARLIN branch: pass
   gemm1_clamp_limit=swiglu_limit into nvfp4_w4a16_moe_quant_config, so the
   model's clamp actually reaches the kernel.

3) oracle/nvfp4.py -- add MARLIN to NVFP4_BACKENDS_WITH_CLAMP so the selector
   no longer discards Marlin when swiglu_limit is set.

Why this is correct:

The Marlin MoE kernel already implements the clamp; it was just never wired to
receive it. In experts/marlin_moe.py, MarlinExperts reads
self.gemm1_clamp_limit = quant_config.gemm1_clamp_limit and passes
clamp_limit=self.gemm1_clamp_limit into the kernel, which does

    if clamp_limit is not None and activation == MoEActivation.SILU:
        swiglu_limit_func(intermediate_cache2,
                          intermediate_cache1.view(-1, w13_num_shards * N),
                          clamp_limit)

swiglu_limit_func is the same routine backing SiluAndMulWithClamp, with
identical math: gate_clamped = min(SiLU(gate), limit);
up_clamped = min(max(up, -limit), limit); out = gate_clamped * up_clamped.
So once gemm1_clamp_limit is delivered, Marlin reproduces the model's intended
clamped SwiGLU exactly, preserving the numerical-stability guard the clamp
exists for (bounding FP4 activation outliers). Without changes (1)+(2) the
clamp would arrive as None and Marlin would silently skip it, producing
unclamped (numerically wrong) output; that is why merely adding MARLIN to the
set would be unsafe on its own, and why all three changes go together.

End-to-end clamp path:

  config.swiglu_limit
   -> select_nvfp4_moe_backend keeps Marlin                          [change 3]
   -> make_nvfp4_moe_quant_config(MARLIN, swiglu_limit)
        -> nvfp4_w4a16_moe_quant_config(gemm1_clamp_limit=swiglu_limit)  [change 2]
             -> FusedMoEQuantConfig.gemm1_clamp_limit = swiglu_limit      [change 1]
   -> MarlinExperts.gemm1_clamp_limit
   -> kernel clamp_limit -> swiglu_limit_func (clamped SwiGLU)

Testing:
- Confirmed the Marlin kernel applies the clamp via swiglu_limit_func, gated on
  MoEActivation.SILU; math identical to SiluAndMulWithClamp.
- Confirmed a clamped NVFP4 MoE model (DeepSeek-V4) sets config.swiglu_limit
  whereas an unclamped one does not, explaining the prior non-Blackwell
  asymmetry.
- Served DeepSeek-V4-Flash (NVFP4) on H100 (SM90) at TP=4 and TP=8: the model
  now selects the Marlin NVFP4 MoE backend and /v1/chat/completions returns
  correct results, where it previously raised the NotImplementedError above.

Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
@mikekg mikekg force-pushed the nvfp4-moe-marlin-swiglu-clamp branch from 68fdc08 to efa1d03 Compare June 23, 2026 17:18
@mikekg

mikekg commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @mgoin @pavanimajety — addressed:

  • Dropped the unit test entirely (per @mgoin). The PR is now just the 4-line
    wiring — gemm1_clamp_limit plumbing in config.py and adding MARLIN to
    NVFP4_BACKENDS_WITH_CLAMP in oracle/nvfp4.py. That also moots the nvfp4_utils
    note (@pavanimajety) since there's no quantizer in the diff anymore.
  • Removed the explanatory comment in oracle/nvfp4.py.
  • Rebased on main.

For the record, I validated the clamp wiring out-of-tree rather than as a
committed test: the Marlin clamp path (fused_marlin_moe(..., clamp_limit=L) →
in-kernel swiglu_limit_func, identical math to SiluAndMulWithClamp(L, alpha=1,
beta=0), matching DeepSeek-V4-Flash Expert.forward) was cross-checked at dsv4
operating points (k=4096, n=2048, topk=6; clamp ∈ {5,7,10,15}; weight_scale_2
∈ {2⁻¹³…2⁻⁹}) against a torch SiluAndMulWithClamp reference and the
FlashInfer TRT-LLM NVFP4 kernel — Marlin matched the torch reference to rel-L2
≈ 0.003–0.007 on A100/H100/B200, and ≈ 0.004 vs real dsv4 expert weights from
the checkpoint. The e2e H100 serving result is in the description.

@pavanimajety pavanimajety left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, lgtm now.

@mgoin mgoin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, great

@vllm-bot vllm-bot merged commit 0775b88 into vllm-project:main Jun 23, 2026
92 of 95 checks passed
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…ped models on non-Blackwell (vllm-project#45836)

Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
@zyongye zyongye added this to the v0.24.0 cherrypick milestone Jun 24, 2026
khluu pushed a commit that referenced this pull request Jun 25, 2026
…ped models on non-Blackwell (#45836)

Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
(cherry picked from commit 0775b88)

Signed-off-by: khluu <khluu000@gmail.com>
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
…ped models on non-Blackwell (vllm-project#45836)

Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
jasl added a commit to jasl/vllm that referenced this pull request Jun 27, 2026
Sync the DeepSeek-V4 SM12x PR branch onto upstream/main (198 commits since
the vllm-project#43477 merge). 6 conflicts resolved:

- oracle/nvfp4.py: union the clamp set -> {TRTLLM, CUTLASS, MARLIN}. Our
  FLASHINFER_CUTLASS clamp fix landed upstream as vllm-project#46492; MARLIN added by
  vllm-project#45836. No fork patch needed anymore.
- engine/protocol.py: keep both DeltaMessage hooks (our reasoning_content
  alias validator + upstream's empty-tool_calls serializer).
- test_deepseek_v4_mega_moe.py: keep CompilationConfig() fixture (real config
  provides static_forward_context; consistent with the other test).
- routed_experts.py: combine the two per-tensor-scale loaders -> one helper
  that does our e8m0 bitwise view AND upstream's 0-D/shape-(1,) normalization.
- serve/render/serving.py + renderers/online_renderer.py: upstream vllm-project#44285 split
  ServingRender into renderer+entrypoint. Re-home our DSv4 thinking->template
  -kwargs threading: Site A (sampling params) into ServingRender.render_chat_
  request via self.online_renderer attrs; Site B (prompt render) into
  OnlineRenderer.render_chat.
- sparse_attn_indexer.py: preserve our SM120 short-row / persistent_topk path
  (logits_width) and add upstream's cooperative_topk gated to EXCLUDE capability
  family 120, so SM12x stays byte-identical to the validated path. Enabling
  cooperative_topk on SM12x is a separate, to-be-validated perf experiment.

Inherited for free: deepseek_v2 clone removal (vllm-project#46651), NVFP4 Marlin SwiGLU
clamp (vllm-project#45836), sampler int32-overflow fix (vllm-project#46560), spec-decode correctness
(vllm-project#45956, vllm-project#46533). Not yet built/validated on GPU.

Signed-off-by: jasl <jasl9187@hotmail.com>
wincent8 pushed a commit to wincent8/vllm that referenced this pull request Jun 29, 2026
…ped models on non-Blackwell (vllm-project#45836)

Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed

5 participants