[Bugfix] Canonicalize FP8 weight layout to (K, N) at the source by mgoin · Pull Request #44735 · vllm-project/vllm

mgoin · 2026-06-06T14:04:45Z

Summary

Fixes #44110 at the root cause, as an alternative to #44113.

MarlinFP8ScaledMMLinearKernel.process_weights_after_loading previously
tried to detect whether its incoming weight was (N, K) or (K, N) and
transpose accordingly. The shape-based heuristic was a no-op when
N == K, silently corrupting square layers. #44113 swaps the heuristic
from shape to is_contiguous(), but that just trades one fragile
implicit contract in the kernel for another.

The real issue is that the kernel boundary has no agreed-on layout:

CutlassFP8ScaledMMLinearKernel expects (K, N) and does not transpose.
ModelOptFp8{,PcPt}LinearMethod already pre-transposes to (K, N).
Fp8LinearMethod (non-marlin), Fp8OnlineLinearMethod (non-marlin) pre-transpose to (K, N).
Fp8LinearMethod (use_marlin) and CompressedTensorsW8A16Fp8 skip the transpose and let Marlin guess.

This PR makes every FP8 linear caller canonicalize to (K, N) before
delegating, and removes the detection heuristic from Marlin entirely.
This is the canonicalization step that the TODO referenced in #33314
was asking for, scoped to the FP8 paths that share this kernel.

Changes

Fp8LinearMethod (use_marlin, non-block): transpose before delegating.
Fp8OnlineLinearMethod: collapse the marlin/non-marlin branches into a single transpose + delegate.
CompressedTensorsW8A16Fp8 (non-block): transpose before delegating.
MarlinFP8ScaledMMLinearKernel: drop the conditional transpose from the non-block branch.

Net diff: 3 files, +11 / -29.

Why this is preferable to #44113

Removes the implicit "kernel detects layout from weight metadata" contract entirely.
Makes the FP8 callers consistent with each other and with the Cutlass kernel's existing expectation.
The is_contiguous() switch in [Bugfix] Fix MarlinFP8 weight transpose silently skipped for square matrices (N==K) #44113 still breaks if any future caller pre-transposes and then calls .contiguous(), or loads a checkpoint that happens to land non-contiguous.

Test plan

Square (N==K, 4096×4096) regression case verified end-to-end on A40 (sm_86) with VLLM_TEST_FORCE_FP8_MARLIN=1, both the checkpoint-layout path (CompressedTensors-style) and the pre-transposed path (ModelOpt-style). Relative error < 0.005 in all cases.
Non-square shapes (1024×4096, 4096×12800) verified for both paths.
pre-commit run --files on the changed files — all hooks pass (ruff, mypy, etc.).
Existing tests/quantization/test_fp8.py and tests/evals/gsm8k/test_gsm8k_correctness.py runs in CI.

AI assistance (Claude) was used to draft the change; I (mgoin) reviewed and tested every line.

Co-authored-by: Claude noreply@anthropic.com

MarlinFP8ScaledMMLinearKernel previously tried to detect whether its incoming weight was (N, K) or (K, N) and transpose accordingly. The shape-based heuristic was a no-op when N == K, silently corrupting square layers (vllm-project#44110). PR vllm-project#44113 swapped the heuristic from shape to is_contiguous(), which still encodes a fragile implicit contract in the kernel. Fix it at the source instead, matching what cutlass already requires and what modelopt already does: each LinearMethod canonicalizes the weight to (K, N) before delegating to the kernel. - Fp8LinearMethod (use_marlin, non-block): transpose before delegating. - Fp8OnlineLinearMethod: collapse the marlin/non-marlin branches into one transpose + delegate. - CompressedTensorsW8A16Fp8 (non-block): transpose before delegating. - MarlinFP8ScaledMMLinearKernel: drop the detect-and-transpose conditional from the non-block branch. This addresses the canonicalization TODO referenced in vllm-project#33314 for the FP8 W8A16 / W8A8 paths, and removes the square-N==K regression at its real root cause. Signed-off-by: mgoin <mike.goin12@gmail.com> Signed-off-by: mgoin <mgoin64@gmail.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-06-06T15:58:29Z

Hi @mgoin, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

…-project#44735) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

…-project#44735) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

…-project#44735) Signed-off-by: mgoin <mgoin64@gmail.com>

…-project#44735) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

…-project#44735) Signed-off-by: mgoin <mgoin64@gmail.com>

…-project#44735) Signed-off-by: mgoin <mgoin64@gmail.com> (cherry picked from commit 6afa250)

…-project#44735) Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin requested review from pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and zyongye as code owners June 6, 2026 14:04

claude Bot reviewed Jun 6, 2026

View reviewed changes

mergify Bot added the bug Something isn't working label Jun 6, 2026

mgoin added ready ONLY add when PR is ready to merge/full CI is needed quantization labels Jun 6, 2026

tjtanaa added the rocm Related to AMD ROCm label Jun 6, 2026

github-project-automation Bot added this to AMD Jun 6, 2026

github-project-automation Bot moved this to Todo in AMD Jun 6, 2026

robertgshaw2-redhat approved these changes Jun 8, 2026

View reviewed changes

mgoin merged commit 6afa250 into vllm-project:main Jun 8, 2026
73 of 75 checks passed

github-project-automation Bot moved this from Todo to Done in AMD Jun 8, 2026

waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026

[Bugfix] Canonicalize FP8 weight layout to (K, N) at the source (vllm…

0b5b6b0

…-project#44735) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026

[Bugfix] Canonicalize FP8 weight layout to (K, N) at the source (vllm…

54bf2c7

…-project#44735) Signed-off-by: mgoin <mgoin64@gmail.com>

vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026

[Bugfix] Canonicalize FP8 weight layout to (K, N) at the source (vllm…

81bde40

…-project#44735) Signed-off-by: mgoin <mgoin64@gmail.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026

[Bugfix] Canonicalize FP8 weight layout to (K, N) at the source (vllm…

cd8f261

…-project#44735) Signed-off-by: mgoin <mgoin64@gmail.com>

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[Bugfix] Canonicalize FP8 weight layout to (K, N) at the source (vllm…

f861fef

…-project#44735) Signed-off-by: mgoin <mgoin64@gmail.com>

Coisinixixi pushed a commit to Coisinixixi/vllm that referenced this pull request Jul 2, 2026

[Bugfix] Canonicalize FP8 weight layout to (K, N) at the source (vllm…

091e9c3

…-project#44735) Signed-off-by: mgoin <mgoin64@gmail.com> (cherry picked from commit 6afa250)

Coisinixixi mentioned this pull request Jul 2, 2026

sync(VLLM-QUANT): cherry-pick initial quantization bugfixes vLLM-HUST/vllm-hust#87

Draft

ohsono pushed a commit to ohsono/vllm that referenced this pull request Jul 3, 2026

[Bugfix] Canonicalize FP8 weight layout to (K, N) at the source (vllm…

a09561c

…-project#44735) Signed-off-by: mgoin <mgoin64@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Canonicalize FP8 weight layout to (K, N) at the source#44735

[Bugfix] Canonicalize FP8 weight layout to (K, N) at the source#44735
mgoin merged 1 commit into
vllm-project:mainfrom
mgoin:fp8-marlin-canonicalize-layout

mgoin commented Jun 6, 2026

claude Bot left a comment

mergify Bot commented Jun 6, 2026

Uh oh!

Labels

3 participants

Uh oh!

Uh oh!

Conversation

mgoin commented Jun 6, 2026

Summary

Changes

Why this is preferable to #44113

Test plan

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

mergify Bot commented Jun 6, 2026

Uh oh!

Labels

3 participants