[Bugfix] FusedMoE: coerce shape-(1,) per-tensor scales to 0-D scalar …#43362
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. Agent GuidelinesIMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban. 🚀 |
8a7be37 to
5d6043b
Compare
There was a problem hiding this comment.
Code Review
This pull request updates the _load_per_tensor_weight_scale method in vllm/model_executor/layers/fused_moe/layer.py to coerce weight scales into 0-D scalars using .view([]), preventing broadcast errors when quantization tools provide shape-(1,) tensors. The reviewer suggested extending this fix to other methods like _load_single_value and _load_combined_w13_weight_scale where similar issues with per-tensor scales, such as input_scale, are likely to occur.
5d6043b to
43c7aff
Compare
d3e2f13 to
2b8d42d
Compare
|
Added TestPerTensorScaleLoading in tests/kernels/moe/test_moe_weight_loading_padded.py covering both _load_per_tensor_weight_scale (w1/w3 and w2) and _load_single_value with shape-(1,) inputs, plus a 0-D regression case and a numel>1 case that confirms .view([]) still fails loudly. 5/5 pass locally. |
2b8d42d to
da3cfd7
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
…in _load_per_tensor_weight_scale
`FusedMoE._load_per_tensor_weight_scale` assumed that per-tensor weight
scales arrive as 0-D scalar tensors. llm-compressor's NVFP4 presets
emit them as shape-(1,) tensors by default (`torch.tensor([x])` rather
than `torch.tensor(x)`). PyTorch's `copy_()` path — used internally by
the chained `__setitem__` in older torch versions — rejects the (1,)->()`
broadcast and raises:
RuntimeError: output with shape [] doesn't match the broadcast shape [1]
This crashes server initialisation for any compressed-tensors NVFP4 MoE
artifact produced by llm-compressor with the standard NVFP4 preset.
Fix: call `.view([])` on `loaded_weight` before the scalar-slot assignment
in both the w1/w3 and w2 paths. `.view([])` requires the tensor to have
exactly one element (fails loudly otherwise, preventing silent corruption)
and returns a 0-D view without copying data.
Fixes vllm-project#43297
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Varshith <kvarshithgowda@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Varshith <kvarshithgowda@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Varshith <kvarshithgowda@gmail.com>
da3cfd7 to
f919761
Compare
Signed-off-by: Varshith <kvarshithgowda@gmail.com>
|
@mgoin thanks. Both red checks trace to one failure: buildkite/ci/pr is the build-level rollup for #72490, and the only failing job under it is distributed-compile-unit-tests-2xh100, which hit a CUDA OOM in test_sequence_parallelism_pass (NCCL Cuda failure 2 'out of memory', 1 failed / 93 passed). Unrelated to this change. A retry of that job should clear both. |
#43362 applied _to_scalar() (reshape(())) to _load_single_value, which is shared by the scalar input_scale loader and the size-2 weight_shape loader (compressed-tensors). reshape(()) rejects the size-2 tensor, crashing weight load for every compressed-tensors WNA16 MoE model: RuntimeError: shape '[]' is invalid for input of size 2. Revert that one assignment to a direct copy; the intended per-tensor weight-scale coercion (#43297) is untouched. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: mgoin <mgoin64@gmail.com>
vllm-project#43362) Signed-off-by: Varshith <kvarshithgowda@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
vllm-project#43362) Signed-off-by: Varshith <kvarshithgowda@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>
Fixes #43297
Purpose
I checked open and merged PR's for
_load_per_tensor_weight_scaleand issue #43297, no existing PR addressed this issue.FusedMoE._load_per_tensor_weight_scaleassumed per-tensor weight scales arrive as 0-D scalars. llm-compressor NVFP4 presets emit them as shape-(1,) tensors by default (torch.tensor([x])rather thantorch.tensor(x)).PyTorch's
copy_()path rejects the(1,) -> ()broadcast:RuntimeError: output with shape [] doesn't match the broadcast shape [1]
This crashes server init for any compressed-tensors NVFP4 MoE artifact produced by llm-compressor before any inference runs.
Fix: call
.view([])onloaded_weightin both the w1/w3 and w2 assignment, paths..view([])coerces a singleton tensor to a 0-D scalar in-place (no copy), and raises error if the tensor has more than one element.Test Plan
Standalone reproducer (torch only, no GPU or model weights needed):
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.