[Kernel] Add PDL support for DeepGEMM kernel by jeejeelee · Pull Request #46006 · vllm-project/vllm

jeejeelee · 2026-06-18T06:44:24Z

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> Signed-off-by: divineearthly <divineearthly@gmail.com>

…rize_with_alignment output-alignment crash guard) Adds one more adversarially-verified pick on top of 2763c4d's vllm-project#44173 + vllm-project#43014. The from-source build (real nvcc compile) recompiles this csrc header, so the fix is runtime-effective (unlike the tiera2/tiera3 prebuilt-binary overlay lineage that could not pick up csrc changes). Applied (cherry-pick 4583630, --no-commit clean, exit 0; 2 files): * vllm#45466 ([Bugfix][Kernel], merged 2026-06-18) — Check output alignment in vectorize_with_alignment. The vector load/store path goes through vec_n_t<T,VEC_SIZE> (declared __align__(VEC_SIZE*sizeof(T))), so BOTH in and out must be aligned to their own vector width. Previously only `in` was checked ("output guaranteed same as input" assumption). reshape_and_cache_ flash writes KV-cache rows at byte offsets that are a multiple of head_size; for head sizes not a multiple of VEC_SIZE this puts some `out` rows off the vector-width boundary -> vectorized store -> CUDA misaligned-address crash (issue vllm-project#41257). The fix adds an OUT_WIDTH alignment check to the fast-path predicate + a post-prefix co-alignment check that falls back to a fully scalar copy when in/out cannot be co-aligned. Bit-identical output (only chooses scalar vs vector path), strictly a hardening — never wrong, only slower on the rare unaligned row. HOT PATH confirmed in this tree: csrc/cache_kernels.cu (reshape_and_cache_ flash, the KV-cache decode write path) includes vectorization_utils.cuh and calls vectorize_with_alignment; also used by w8a8/fp8 common.cu, int8 scaled_quant.cu, layernorm_kernels.cu, layernorm_quant_kernels.cu, libtorch_stable per_token_group_quant.cu. Arch-portable header (compiles on sm_121a like every other arch). Zero downside even if DSV4's current head dims don't trip it today. Intentionally SKIPPED this round (each adversarially analyzed; all are DEAD-PATH on this GB10/sm_121a + b12x deployment, not forced builds): * b12x individual commits cb98da162 (SM120 dense FP8 GEMM) / c7089a418 / 0ff2847b0 — b12x is a PREBUILT BINARY package here (import b12x.integration, flashinfer.b12x_fused_moe), not a source tree. These SHAs exist in no fetched ref (they target the newer b12x v0.23 generation, not the eb99b8b DSV4 base). Not cherry-pickable; full v0.23 ABI absorption remains a separate effort. * flashinfer vllm-project#3640 (SM120 NVFP4 attention) — DEAD PATH. DSV4 decode routes MLA through b12x_compressed_mla_decode (prebuilt) with a sparse_mla fallback; vLLM has ZERO call sites into flashinfer's nvfp4_attention_sm120. Also in no release tag yet (main-only, post-rc2-cut). * flashinfer vllm-project#3309 (MLA decode num_heads<128 fold) — DEAD PATH. Patches flashinfer cute_dsl.attention.mla_decode, but vLLM imports flashinfer cute_dsl ONLY for MoE/GEMM (blockscaled_gemm, fused_moe). DSV4 MLA-decode is b12x/sparse_mla. No call site. * DeepGEMM vllm-project#324 (nv_dev, sm121 MQA-logits / HC-prenorm) — DEAD PATH. OPEN (not merged), against deepseek-ai/DeepGEMM nv_dev. vLLM's is_device_capability_family(120) shunt in vllm/utils/deep_gemm.py returns BEFORE native DeepGEMM _lazy_init, sending MQA-logits + HC-prenorm to hand-written Triton sm12x kernels (sm12x_mqa.py, sm12x_deep_gemm_ fallbacks.py). b12x covers the dense-GEMM/MoE surface. vllm-project#324's kernels would compile but never be called on GB10. * vllm#44217 ([Perf] dsv3_router_gemm heuristic) — DEAD PATH + out of csrc scope (Python-only). Gates the specialized kernel to is_hopper((9,0)) || is_blackwell(family 100); GB10 is sm_121a (CC 12.1) = NEITHER, so allow_dsv3_router_gemm is already False here. * vllm#43557 (E8M0 scale MXFP4 W4A4 CUTLASS) — cherry-picks clean but DEAD code on sm_121a: mxfp4_experts_quant.cu is gated to FP4_ARCHS=10.0a/10.1a/ 10.3a (ENABLE_NVFP4_SM100). GB10 MXFP4 experts use Marlin, not this kernel. * vllm-project#42996/vllm-project#46006 (PDL for DeepGEMM), vllm-project#46070 (revert vllm-project#42379), vllm-project#44109 (weightless RMSNorm), vllm-project#45277 (build-infra), torch-stable-ABI migration series [6/n]-[12/n], vllm-project#43827 (DSv4 TRTLLM attn — the vllm-project#43162 nested-layout trap) — conflict / ABI-refactor / deletion / nested b12x-v0.23 layout absent here. Methodology: clean cherry-pick != effective. The decisive gate for nearly every SKIP was CODE ROUTING, not the diff applying: this DSV4-on-GB10 build sends its hot kernels through b12x (prebuilt) + Triton sm12x fallbacks + Marlin MXFP4, while upstream CUTLASS/DeepGEMM/specialized-kernel paths are arch-gated to SM90/SM100 and do not execute on sm_121a. Picks that target those paths are no-ops here regardless of how cleanly they apply. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>

jeejeelee and others added 13 commits May 18, 2026 10:08

init

acd0be9

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>

Move

dc16cc6

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>

Merge branch 'main' into dg-enable-pdl

8d0fbff

Merge remote-tracking branch 'origin/main' into dg-enable-pdl

ba65085

Merge remote-tracking branch 'origin/main' into dg-enable-pdl

32c7420

Move

391ed5c

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>

Merge branch 'main' into dg-enable-pdl

a0c5370

Merge branch 'main' into dg-enable-pdl

0021ff1

Merge remote-tracking branch 'origin/main' into dg-enable-pdl

4aa1a89

Move

273eb8b

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>

FIX

9043f86

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>

OPT

3097386

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>

Fix ROCM

a1ef9d5

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>

jeejeelee requested a review from zyongye as a code owner June 18, 2026 06:44

Merge branch 'main' into dg-enable-pdl

d68bd44

jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 18, 2026

DarkLight1337 approved these changes Jun 18, 2026

View reviewed changes

Alex-ai-future mentioned this pull request Jun 18, 2026

fix(quantization): Fix AWQ dequantize on Intel XPU and refactor AutoAWQ config #42727

Merged

jeejeelee merged commit 22cc891 into main Jun 18, 2026
200 checks passed

jeejeelee deleted the dg-enable-pdl branch June 18, 2026 12:49

divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026

[Kernel] Add PDL support for DeepGEMM kernel (vllm-project#46006)

eef76b1

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> Signed-off-by: divineearthly <divineearthly@gmail.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Jun 21, 2026

[Kernel] Add PDL support for DeepGEMM kernel (vllm-project#46006)

2cc73e2

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026

[Kernel] Add PDL support for DeepGEMM kernel (vllm-project#46006)

1f5ba7a

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[Kernel] Add PDL support for DeepGEMM kernel (vllm-project#46006)

ff523ac

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>

bobvious mentioned this pull request Jun 30, 2026

[Bug] v0.24.0: DeepGEMM "Unknown recipe" assertion in FP8 kernel warmup on Blackwell (sm_120) — regression vs 0.23.0 #47130

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel] Add PDL support for DeepGEMM kernel#46006

[Kernel] Add PDL support for DeepGEMM kernel#46006
jeejeelee merged 14 commits into
mainfrom
dg-enable-pdl

jeejeelee commented Jun 18, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Labels

2 participants

Uh oh!

Uh oh!

Conversation

jeejeelee commented Jun 18, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Labels

2 participants

jeejeelee commented Jun 18, 2026 •

edited by github-actions Bot

Loading