Skip to content

[Bugfix] Fix NVFP4/OCP MX MoE emulation#46254

Merged
AndreasKaratzas merged 3 commits into
vllm-project:mainfrom
mawong-amd:mawong/fix-nvfp4-lm-eval-large-models
Jun 21, 2026
Merged

[Bugfix] Fix NVFP4/OCP MX MoE emulation#46254
AndreasKaratzas merged 3 commits into
vllm-project:mainfrom
mawong-amd:mawong/fix-nvfp4-lm-eval-large-models

Conversation

@mawong-amd

@mawong-amd mawong-amd commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Purpose

This PR fixes an issue in the NVFP4/OCP MX MoE emulation code paths caused by #42120.
In that PR, TritonExperts.apply in vllm/model_executor/layers/fused_moe/experts/triton_moe.py was modified to call moe_kernel_quantize_input on the activations if TritonExperts.expects_unquantized_inputs == True. However, this flag is pre-set to True for NVFP4 and OCP MX emulation code paths, which also call moe_kernel_quantize_input on activations before entering TritonExperts.apply. The end-result is that moe_kernel_quantize_input is erroneously called twice on activations when NVFP4/OCP MX emulation is active.

Test Plan

NVFP4 emulation is tested by the following
pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-mi3xx.txt
which loads a NVFP4-quantized model (nvidia/Qwen3-30B-A3B-FP4) and runs it in emulation mode on a AMD gfx942 based machine. This is run as part of AMD CI in the AMD: LM Eval Large Models (H200) (mi300_8) test group.

Test Result

The above test group passes.

cc: @AndreasKaratzas. Also @fxmarty-amd, who noticed the same errors and has pending fixes for it in #46142 (for OCP MX) and #44667 (for NVFP4)


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
@mawong-amd mawong-amd changed the title [] Fix NVFP4 MoE emulation's A1 quantization Jun 20, 2026
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
@mergify mergify Bot added the bug Something isn't working label Jun 20, 2026
@mawong-amd mawong-amd changed the title [Bugfix] Fix NVFP4/OCP MX MoE emulation activation quantization Jun 20, 2026
@mawong-amd mawong-amd force-pushed the mawong/fix-nvfp4-lm-eval-large-models branch from 750426e to 05064c1 Compare June 20, 2026 23:31
@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 20, 2026
…ave time

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
@mergify mergify Bot added the ci/build label Jun 21, 2026

@AndreasKaratzas AndreasKaratzas left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you for fixing that issue.

@AndreasKaratzas AndreasKaratzas merged commit a346d58 into vllm-project:main Jun 21, 2026
90 checks passed
@mawong-amd mawong-amd deleted the mawong/fix-nvfp4-lm-eval-large-models branch June 21, 2026 05:34
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
@fxmarty-amd

Copy link
Copy Markdown
Contributor

@AndreasKaratzas @mawong-amd I think this PR is not sufficient, especially regarding:

a1q_scale if a1q_scale is not None else self.a1_scale,

See the comment about it at #44667 (comment)

@fxmarty-amd

Copy link
Copy Markdown
Contributor

The PR at #46142 should fix that (self.a1_scale being wrongfully used in emulation case)

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
hidden_states, a1q_scale = moe_kernel_quantize_input(
hidden_states,
self.a1_scale,
self.a1_scale or self.a1_gscale,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results in:

(EngineCore pid=3325911)   File "/felmarty/repos/vllm/vllm/model_executor/layers/fused_moe/experts/triton_moe.py", line 248, in apply
(EngineCore pid=3325911)     self.a1_scale or self.a1_gscale,
(EngineCore pid=3325911)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3325911) torch.AcceleratorError: HIP error: operation not permitted when stream is capturing
(EngineCore pid=3325911) Search for `hipErrorStreamCaptureUnsupported' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
(EngineCore pid=3325911) HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore pid=3325911) For debugging consider passing AMD_SERIALIZE_KERNEL=3
(EngineCore pid=3325911) Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
(EngineCore pid=3325911)
[rank0]:[W625 14:59:32.481659036 ProcessGroupNCCL.cpp:1554] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build ready ONLY add when PR is ready to merge/full CI is needed

3 participants