Skip to content

[Kernel] Add PDL support for DeepGEMM kernel#46006

Merged
jeejeelee merged 14 commits into
mainfrom
dg-enable-pdl
Jun 18, 2026
Merged

[Kernel] Add PDL support for DeepGEMM kernel#46006
jeejeelee merged 14 commits into
mainfrom
dg-enable-pdl

Conversation

@jeejeelee

@jeejeelee jeejeelee commented Jun 18, 2026

Copy link
Copy Markdown
Member

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
jeejeelee and others added 13 commits May 18, 2026 10:08
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
@jeejeelee jeejeelee requested a review from zyongye as a code owner June 18, 2026 06:44
@jeejeelee jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 18, 2026
@jeejeelee jeejeelee merged commit 22cc891 into main Jun 18, 2026
200 checks passed
@jeejeelee jeejeelee deleted the dg-enable-pdl branch June 18, 2026 12:49
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: divineearthly <divineearthly@gmail.com>
choiceoh added a commit to choiceoh/vllm-dsv4 that referenced this pull request Jun 20, 2026
…rize_with_alignment output-alignment crash guard)

Adds one more adversarially-verified pick on top of 2763c4d's vllm-project#44173 +
vllm-project#43014. The from-source build (real nvcc compile) recompiles this csrc header,
so the fix is runtime-effective (unlike the tiera2/tiera3 prebuilt-binary
overlay lineage that could not pick up csrc changes).

Applied (cherry-pick 4583630, --no-commit clean, exit 0; 2 files):

  * vllm#45466 ([Bugfix][Kernel], merged 2026-06-18) — Check output alignment
    in vectorize_with_alignment. The vector load/store path goes through
    vec_n_t<T,VEC_SIZE> (declared __align__(VEC_SIZE*sizeof(T))), so BOTH in and
    out must be aligned to their own vector width. Previously only `in` was
    checked ("output guaranteed same as input" assumption). reshape_and_cache_
    flash writes KV-cache rows at byte offsets that are a multiple of head_size;
    for head sizes not a multiple of VEC_SIZE this puts some `out` rows off the
    vector-width boundary -> vectorized store -> CUDA misaligned-address crash
    (issue vllm-project#41257). The fix adds an OUT_WIDTH alignment check to the fast-path
    predicate + a post-prefix co-alignment check that falls back to a fully
    scalar copy when in/out cannot be co-aligned. Bit-identical output (only
    chooses scalar vs vector path), strictly a hardening — never wrong, only
    slower on the rare unaligned row.
    HOT PATH confirmed in this tree: csrc/cache_kernels.cu (reshape_and_cache_
    flash, the KV-cache decode write path) includes vectorization_utils.cuh and
    calls vectorize_with_alignment; also used by w8a8/fp8 common.cu, int8
    scaled_quant.cu, layernorm_kernels.cu, layernorm_quant_kernels.cu,
    libtorch_stable per_token_group_quant.cu. Arch-portable header (compiles on
    sm_121a like every other arch). Zero downside even if DSV4's current head
    dims don't trip it today.

Intentionally SKIPPED this round (each adversarially analyzed; all are
DEAD-PATH on this GB10/sm_121a + b12x deployment, not forced builds):

  * b12x individual commits cb98da162 (SM120 dense FP8 GEMM) / c7089a418 /
    0ff2847b0 — b12x is a PREBUILT BINARY package here (import b12x.integration,
    flashinfer.b12x_fused_moe), not a source tree. These SHAs exist in no
    fetched ref (they target the newer b12x v0.23 generation, not the eb99b8b
    DSV4 base). Not cherry-pickable; full v0.23 ABI absorption remains a
    separate effort.

  * flashinfer vllm-project#3640 (SM120 NVFP4 attention) — DEAD PATH. DSV4 decode routes
    MLA through b12x_compressed_mla_decode (prebuilt) with a sparse_mla
    fallback; vLLM has ZERO call sites into flashinfer's nvfp4_attention_sm120.
    Also in no release tag yet (main-only, post-rc2-cut).

  * flashinfer vllm-project#3309 (MLA decode num_heads<128 fold) — DEAD PATH. Patches
    flashinfer cute_dsl.attention.mla_decode, but vLLM imports flashinfer
    cute_dsl ONLY for MoE/GEMM (blockscaled_gemm, fused_moe). DSV4 MLA-decode
    is b12x/sparse_mla. No call site.

  * DeepGEMM vllm-project#324 (nv_dev, sm121 MQA-logits / HC-prenorm) — DEAD PATH. OPEN (not
    merged), against deepseek-ai/DeepGEMM nv_dev. vLLM's
    is_device_capability_family(120) shunt in vllm/utils/deep_gemm.py returns
    BEFORE native DeepGEMM _lazy_init, sending MQA-logits + HC-prenorm to
    hand-written Triton sm12x kernels (sm12x_mqa.py, sm12x_deep_gemm_
    fallbacks.py). b12x covers the dense-GEMM/MoE surface. vllm-project#324's kernels would
    compile but never be called on GB10.

  * vllm#44217 ([Perf] dsv3_router_gemm heuristic) — DEAD PATH + out of csrc
    scope (Python-only). Gates the specialized kernel to is_hopper((9,0)) ||
    is_blackwell(family 100); GB10 is sm_121a (CC 12.1) = NEITHER, so
    allow_dsv3_router_gemm is already False here.

  * vllm#43557 (E8M0 scale MXFP4 W4A4 CUTLASS) — cherry-picks clean but DEAD
    code on sm_121a: mxfp4_experts_quant.cu is gated to FP4_ARCHS=10.0a/10.1a/
    10.3a (ENABLE_NVFP4_SM100). GB10 MXFP4 experts use Marlin, not this kernel.

  * vllm-project#42996/vllm-project#46006 (PDL for DeepGEMM), vllm-project#46070 (revert vllm-project#42379), vllm-project#44109 (weightless
    RMSNorm), vllm-project#45277 (build-infra), torch-stable-ABI migration series
    [6/n]-[12/n], vllm-project#43827 (DSv4 TRTLLM attn — the vllm-project#43162 nested-layout trap) —
    conflict / ABI-refactor / deletion / nested b12x-v0.23 layout absent here.

Methodology: clean cherry-pick != effective. The decisive gate for nearly every
SKIP was CODE ROUTING, not the diff applying: this DSV4-on-GB10 build sends its
hot kernels through b12x (prebuilt) + Triton sm12x fallbacks + Marlin MXFP4,
while upstream CUTLASS/DeepGEMM/specialized-kernel paths are arch-gated to
SM90/SM100 and do not execute on sm_121a. Picks that target those paths are
no-ops here regardless of how cleanly they apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Jun 21, 2026
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

2 participants