[Bugfix] Fix NVFP4/OCP MX MoE emulation#46254
Merged
AndreasKaratzas merged 3 commits intoJun 21, 2026
Merged
Conversation
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
750426e to
05064c1
Compare
…ave time Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
AndreasKaratzas
approved these changes
Jun 21, 2026
AndreasKaratzas
left a comment
Member
There was a problem hiding this comment.
LGTM
Thank you for fixing that issue.
tunglinwood
pushed a commit
to tunglinwood/vllm
that referenced
this pull request
Jun 22, 2026
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Contributor
|
@AndreasKaratzas @mawong-amd I think this PR is not sufficient, especially regarding: See the comment about it at #44667 (comment) |
Contributor
|
The PR at #46142 should fix that ( |
nkzhenhua
pushed a commit
to nkzhenhua/vllm
that referenced
this pull request
Jun 24, 2026
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
fxmarty-amd
reviewed
Jun 25, 2026
| hidden_states, a1q_scale = moe_kernel_quantize_input( | ||
| hidden_states, | ||
| self.a1_scale, | ||
| self.a1_scale or self.a1_gscale, |
Contributor
There was a problem hiding this comment.
Results in:
(EngineCore pid=3325911) File "/felmarty/repos/vllm/vllm/model_executor/layers/fused_moe/experts/triton_moe.py", line 248, in apply
(EngineCore pid=3325911) self.a1_scale or self.a1_gscale,
(EngineCore pid=3325911) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3325911) torch.AcceleratorError: HIP error: operation not permitted when stream is capturing
(EngineCore pid=3325911) Search for `hipErrorStreamCaptureUnsupported' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
(EngineCore pid=3325911) HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore pid=3325911) For debugging consider passing AMD_SERIALIZE_KERNEL=3
(EngineCore pid=3325911) Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
(EngineCore pid=3325911)
[rank0]:[W625 14:59:32.481659036 ProcessGroupNCCL.cpp:1554] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
qli88
pushed a commit
to qli88/vllm
that referenced
this pull request
Jun 26, 2026
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
This PR fixes an issue in the NVFP4/OCP MX MoE emulation code paths caused by #42120.
In that PR,
TritonExperts.applyinvllm/model_executor/layers/fused_moe/experts/triton_moe.pywas modified to callmoe_kernel_quantize_inputon the activations ifTritonExperts.expects_unquantized_inputs == True. However, this flag is pre-set toTruefor NVFP4 and OCP MX emulation code paths, which also callmoe_kernel_quantize_inputon activations before enteringTritonExperts.apply. The end-result is thatmoe_kernel_quantize_inputis erroneously called twice on activations when NVFP4/OCP MX emulation is active.Test Plan
NVFP4 emulation is tested by the following
pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-mi3xx.txtwhich loads a NVFP4-quantized model (
nvidia/Qwen3-30B-A3B-FP4) and runs it in emulation mode on a AMDgfx942based machine. This is run as part of AMD CI in theAMD: LM Eval Large Models (H200) (mi300_8)test group.Test Result
The above test group passes.
cc: @AndreasKaratzas. Also @fxmarty-amd, who noticed the same errors and has pending fixes for it in #46142 (for OCP MX) and #44667 (for NVFP4)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.