[Bugfix] Fix NVFP4/OCP MX MoE emulation by mawong-amd · Pull Request #46254 · vllm-project/vllm

mawong-amd · 2026-06-20T23:28:54Z

Purpose

This PR fixes an issue in the NVFP4/OCP MX MoE emulation code paths caused by #42120.
In that PR, TritonExperts.apply in vllm/model_executor/layers/fused_moe/experts/triton_moe.py was modified to call moe_kernel_quantize_input on the activations if TritonExperts.expects_unquantized_inputs == True. However, this flag is pre-set to True for NVFP4 and OCP MX emulation code paths, which also call moe_kernel_quantize_input on activations before entering TritonExperts.apply. The end-result is that moe_kernel_quantize_input is erroneously called twice on activations when NVFP4/OCP MX emulation is active.

Test Plan

NVFP4 emulation is tested by the following
pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-mi3xx.txt
which loads a NVFP4-quantized model (nvidia/Qwen3-30B-A3B-FP4) and runs it in emulation mode on a AMD gfx942 based machine. This is run as part of AMD CI in the AMD: LM Eval Large Models (H200) (mi300_8) test group.

Test Result

The above test group passes.

cc: @AndreasKaratzas. Also @fxmarty-amd, who noticed the same errors and has pending fixes for it in #46142 (for OCP MX) and #44667 (for NVFP4)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>

…ave time Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>

AndreasKaratzas

LGTM

Thank you for fixing that issue.

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>

fxmarty-amd · 2026-06-22T14:11:17Z

@AndreasKaratzas @mawong-amd I think this PR is not sufficient, especially regarding:

vllm/vllm/model_executor/layers/fused_moe/experts/triton_moe.py

Line 340 in 6871738

a1q_scale if a1q_scale is not None else self.a1_scale,

See the comment about it at #44667 (comment)

fxmarty-amd · 2026-06-22T14:11:48Z

The PR at #46142 should fix that (self.a1_scale being wrongfully used in emulation case)

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>

fxmarty-amd · 2026-06-25T15:04:15Z

            hidden_states, a1q_scale = moe_kernel_quantize_input(
                hidden_states,
-                self.a1_scale,
+                self.a1_scale or self.a1_gscale,


Results in:

(EngineCore pid=3325911) File "/felmarty/repos/vllm/vllm/model_executor/layers/fused_moe/experts/triton_moe.py", line 248, in apply (EngineCore pid=3325911) self.a1_scale or self.a1_gscale, (EngineCore pid=3325911) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3325911) torch.AcceleratorError: HIP error: operation not permitted when stream is capturing (EngineCore pid=3325911) Search for `hipErrorStreamCaptureUnsupported' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. (EngineCore pid=3325911) HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=3325911) For debugging consider passing AMD_SERIALIZE_KERNEL=3 (EngineCore pid=3325911) Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions. (EngineCore pid=3325911) [rank0]:[W625 14:59:32.481659036 ProcessGroupNCCL.cpp:1554] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

Fix NVFP4 MoE emulation's A1 quantization

f35ed3d

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>

mawong-amd requested review from mgoin, pavanimajety and zyongye as code owners June 20, 2026 23:28

mawong-amd changed the title ~~[] Fix NVFP4 MoE emulation's A1 quantization~~ Jun 20, 2026

Fix for OCP MX MoE emulation as well

05064c1

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>

mergify Bot added the bug Something isn't working label Jun 20, 2026

mawong-amd changed the title ~~[Bugfix] Fix NVFP4/OCP MX MoE emulation activation quantization~~ Jun 20, 2026

mawong-amd force-pushed the mawong/fix-nvfp4-lm-eval-large-models branch from 750426e to 05064c1 Compare June 20, 2026 23:31

AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 20, 2026

Limit Quark compilation to gfx942 in LM Eval Large Models mirror to s…

afd387a

…ave time Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>

mawong-amd requested review from Harry-Chen and khluu as code owners June 21, 2026 00:23

mergify Bot added the ci/build label Jun 21, 2026

AndreasKaratzas approved these changes Jun 21, 2026

View reviewed changes

AndreasKaratzas merged commit a346d58 into vllm-project:main Jun 21, 2026
90 checks passed

mawong-amd deleted the mawong/fix-nvfp4-lm-eval-large-models branch June 21, 2026 05:34

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026

[Bugfix] Fix NVFP4/OCP MX MoE emulation (vllm-project#46254)

52af46d

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[Bugfix] Fix NVFP4/OCP MX MoE emulation (vllm-project#46254)

46e992e

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>

fxmarty-amd reviewed Jun 25, 2026

View reviewed changes

fxmarty-amd mentioned this pull request Jun 25, 2026

[CI] Fix failing CUDA graph capture in Triton MOE #46735

Merged

qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026

[Bugfix] Fix NVFP4/OCP MX MoE emulation (vllm-project#46254)

c9703ca

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Fix NVFP4/OCP MX MoE emulation#46254

[Bugfix] Fix NVFP4/OCP MX MoE emulation#46254
AndreasKaratzas merged 3 commits into
vllm-project:mainfrom
mawong-amd:mawong/fix-nvfp4-lm-eval-large-models

mawong-amd commented Jun 20, 2026 •

edited

Loading

AndreasKaratzas left a comment

Uh oh!

fxmarty-amd commented Jun 22, 2026

fxmarty-amd commented Jun 22, 2026

fxmarty-amd Jun 25, 2026

Labels

3 participants

Uh oh!

Uh oh!

Conversation

mawong-amd commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

AndreasKaratzas left a comment

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd commented Jun 22, 2026

fxmarty-amd commented Jun 22, 2026

fxmarty-amd Jun 25, 2026

Choose a reason for hiding this comment

Labels

3 participants

mawong-amd commented Jun 20, 2026 •

edited

Loading