Skip to content

[Bugfix] Canonicalize FP8 weight layout to (K, N) at the source#44735

Merged
mgoin merged 1 commit into
vllm-project:mainfrom
mgoin:fp8-marlin-canonicalize-layout
Jun 8, 2026
Merged

[Bugfix] Canonicalize FP8 weight layout to (K, N) at the source#44735
mgoin merged 1 commit into
vllm-project:mainfrom
mgoin:fp8-marlin-canonicalize-layout

Conversation

@mgoin

@mgoin mgoin commented Jun 6, 2026

Copy link
Copy Markdown
Member

Summary

Fixes #44110 at the root cause, as an alternative to #44113.

MarlinFP8ScaledMMLinearKernel.process_weights_after_loading previously
tried to detect whether its incoming weight was (N, K) or (K, N) and
transpose accordingly. The shape-based heuristic was a no-op when
N == K, silently corrupting square layers. #44113 swaps the heuristic
from shape to is_contiguous(), but that just trades one fragile
implicit contract in the kernel for another.

The real issue is that the kernel boundary has no agreed-on layout:

  • CutlassFP8ScaledMMLinearKernel expects (K, N) and does not transpose.
  • ModelOptFp8{,PcPt}LinearMethod already pre-transposes to (K, N).
  • Fp8LinearMethod (non-marlin), Fp8OnlineLinearMethod (non-marlin) pre-transpose to (K, N).
  • Fp8LinearMethod (use_marlin) and CompressedTensorsW8A16Fp8 skip the transpose and let Marlin guess.

This PR makes every FP8 linear caller canonicalize to (K, N) before
delegating, and removes the detection heuristic from Marlin entirely.
This is the canonicalization step that the TODO referenced in #33314
was asking for, scoped to the FP8 paths that share this kernel.

Changes

  • Fp8LinearMethod (use_marlin, non-block): transpose before delegating.
  • Fp8OnlineLinearMethod: collapse the marlin/non-marlin branches into a single transpose + delegate.
  • CompressedTensorsW8A16Fp8 (non-block): transpose before delegating.
  • MarlinFP8ScaledMMLinearKernel: drop the conditional transpose from the non-block branch.

Net diff: 3 files, +11 / -29.

Why this is preferable to #44113

Test plan

  • Square (N==K, 4096×4096) regression case verified end-to-end on A40 (sm_86) with VLLM_TEST_FORCE_FP8_MARLIN=1, both the checkpoint-layout path (CompressedTensors-style) and the pre-transposed path (ModelOpt-style). Relative error < 0.005 in all cases.
  • Non-square shapes (1024×4096, 4096×12800) verified for both paths.
  • pre-commit run --files on the changed files — all hooks pass (ruff, mypy, etc.).
  • Existing tests/quantization/test_fp8.py and tests/evals/gsm8k/test_gsm8k_correctness.py runs in CI.

AI assistance (Claude) was used to draft the change; I (mgoin) reviewed and tested every line.

Co-authored-by: Claude noreply@anthropic.com

MarlinFP8ScaledMMLinearKernel previously tried to detect whether its
incoming weight was (N, K) or (K, N) and transpose accordingly. The
shape-based heuristic was a no-op when N == K, silently corrupting
square layers (vllm-project#44110). PR vllm-project#44113 swapped the heuristic from shape to
is_contiguous(), which still encodes a fragile implicit contract in the
kernel.

Fix it at the source instead, matching what cutlass already requires
and what modelopt already does: each LinearMethod canonicalizes the
weight to (K, N) before delegating to the kernel.

- Fp8LinearMethod (use_marlin, non-block): transpose before delegating.
- Fp8OnlineLinearMethod: collapse the marlin/non-marlin branches into
  one transpose + delegate.
- CompressedTensorsW8A16Fp8 (non-block): transpose before delegating.
- MarlinFP8ScaledMMLinearKernel: drop the detect-and-transpose
  conditional from the non-block branch.

This addresses the canonicalization TODO referenced in vllm-project#33314 for the
FP8 W8A16 / W8A8 paths, and removes the square-N==K regression at its
real root cause.

Signed-off-by: mgoin <mike.goin12@gmail.com>

Signed-off-by: mgoin <mgoin64@gmail.com>

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the bug Something isn't working label Jun 6, 2026
@mgoin mgoin added ready ONLY add when PR is ready to merge/full CI is needed quantization labels Jun 6, 2026
@mergify

mergify Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Hi @mgoin, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@tjtanaa tjtanaa added the rocm Related to AMD ROCm label Jun 6, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Jun 6, 2026
@mgoin mgoin merged commit 6afa250 into vllm-project:main Jun 8, 2026
73 of 75 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Jun 8, 2026
ekagra-ranjan pushed a commit to ekagra-ranjan/vllm that referenced this pull request Jun 9, 2026
…-project#44735)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026
…-project#44735)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…-project#44735)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
Coisinixixi pushed a commit to Coisinixixi/vllm that referenced this pull request Jul 2, 2026
…-project#44735)

Signed-off-by: mgoin <mgoin64@gmail.com>
(cherry picked from commit 6afa250)
ohsono pushed a commit to ohsono/vllm that referenced this pull request Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working quantization ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

3 participants