[Kernel] Add PDL support for DeepGEMM kernel#46006
Merged
Merged
Conversation
DarkLight1337
approved these changes
Jun 18, 2026
divineearthly
pushed a commit
to divineearthly/vllm
that referenced
this pull request
Jun 19, 2026
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> Signed-off-by: divineearthly <divineearthly@gmail.com>
choiceoh
added a commit
to choiceoh/vllm-dsv4
that referenced
this pull request
Jun 20, 2026
…rize_with_alignment output-alignment crash guard) Adds one more adversarially-verified pick on top of 2763c4d's vllm-project#44173 + vllm-project#43014. The from-source build (real nvcc compile) recompiles this csrc header, so the fix is runtime-effective (unlike the tiera2/tiera3 prebuilt-binary overlay lineage that could not pick up csrc changes). Applied (cherry-pick 4583630, --no-commit clean, exit 0; 2 files): * vllm#45466 ([Bugfix][Kernel], merged 2026-06-18) — Check output alignment in vectorize_with_alignment. The vector load/store path goes through vec_n_t<T,VEC_SIZE> (declared __align__(VEC_SIZE*sizeof(T))), so BOTH in and out must be aligned to their own vector width. Previously only `in` was checked ("output guaranteed same as input" assumption). reshape_and_cache_ flash writes KV-cache rows at byte offsets that are a multiple of head_size; for head sizes not a multiple of VEC_SIZE this puts some `out` rows off the vector-width boundary -> vectorized store -> CUDA misaligned-address crash (issue vllm-project#41257). The fix adds an OUT_WIDTH alignment check to the fast-path predicate + a post-prefix co-alignment check that falls back to a fully scalar copy when in/out cannot be co-aligned. Bit-identical output (only chooses scalar vs vector path), strictly a hardening — never wrong, only slower on the rare unaligned row. HOT PATH confirmed in this tree: csrc/cache_kernels.cu (reshape_and_cache_ flash, the KV-cache decode write path) includes vectorization_utils.cuh and calls vectorize_with_alignment; also used by w8a8/fp8 common.cu, int8 scaled_quant.cu, layernorm_kernels.cu, layernorm_quant_kernels.cu, libtorch_stable per_token_group_quant.cu. Arch-portable header (compiles on sm_121a like every other arch). Zero downside even if DSV4's current head dims don't trip it today. Intentionally SKIPPED this round (each adversarially analyzed; all are DEAD-PATH on this GB10/sm_121a + b12x deployment, not forced builds): * b12x individual commits cb98da162 (SM120 dense FP8 GEMM) / c7089a418 / 0ff2847b0 — b12x is a PREBUILT BINARY package here (import b12x.integration, flashinfer.b12x_fused_moe), not a source tree. These SHAs exist in no fetched ref (they target the newer b12x v0.23 generation, not the eb99b8b DSV4 base). Not cherry-pickable; full v0.23 ABI absorption remains a separate effort. * flashinfer vllm-project#3640 (SM120 NVFP4 attention) — DEAD PATH. DSV4 decode routes MLA through b12x_compressed_mla_decode (prebuilt) with a sparse_mla fallback; vLLM has ZERO call sites into flashinfer's nvfp4_attention_sm120. Also in no release tag yet (main-only, post-rc2-cut). * flashinfer vllm-project#3309 (MLA decode num_heads<128 fold) — DEAD PATH. Patches flashinfer cute_dsl.attention.mla_decode, but vLLM imports flashinfer cute_dsl ONLY for MoE/GEMM (blockscaled_gemm, fused_moe). DSV4 MLA-decode is b12x/sparse_mla. No call site. * DeepGEMM vllm-project#324 (nv_dev, sm121 MQA-logits / HC-prenorm) — DEAD PATH. OPEN (not merged), against deepseek-ai/DeepGEMM nv_dev. vLLM's is_device_capability_family(120) shunt in vllm/utils/deep_gemm.py returns BEFORE native DeepGEMM _lazy_init, sending MQA-logits + HC-prenorm to hand-written Triton sm12x kernels (sm12x_mqa.py, sm12x_deep_gemm_ fallbacks.py). b12x covers the dense-GEMM/MoE surface. vllm-project#324's kernels would compile but never be called on GB10. * vllm#44217 ([Perf] dsv3_router_gemm heuristic) — DEAD PATH + out of csrc scope (Python-only). Gates the specialized kernel to is_hopper((9,0)) || is_blackwell(family 100); GB10 is sm_121a (CC 12.1) = NEITHER, so allow_dsv3_router_gemm is already False here. * vllm#43557 (E8M0 scale MXFP4 W4A4 CUTLASS) — cherry-picks clean but DEAD code on sm_121a: mxfp4_experts_quant.cu is gated to FP4_ARCHS=10.0a/10.1a/ 10.3a (ENABLE_NVFP4_SM100). GB10 MXFP4 experts use Marlin, not this kernel. * vllm-project#42996/vllm-project#46006 (PDL for DeepGEMM), vllm-project#46070 (revert vllm-project#42379), vllm-project#44109 (weightless RMSNorm), vllm-project#45277 (build-infra), torch-stable-ABI migration series [6/n]-[12/n], vllm-project#43827 (DSv4 TRTLLM attn — the vllm-project#43162 nested-layout trap) — conflict / ABI-refactor / deletion / nested b12x-v0.23 layout absent here. Methodology: clean cherry-pick != effective. The decisive gate for nearly every SKIP was CODE ROUTING, not the diff applying: this DSV4-on-GB10 build sends its hot kernels through b12x (prebuilt) + Triton sm12x fallbacks + Marlin MXFP4, while upstream CUTLASS/DeepGEMM/specialized-kernel paths are arch-gated to SM90/SM100 and do not execute on sm_121a. Picks that target those paths are no-ops here regardless of how cleanly they apply. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
xuebwang-amd
pushed a commit
to xuebwang-amd/vllm
that referenced
this pull request
Jun 21, 2026
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
tunglinwood
pushed a commit
to tunglinwood/vllm
that referenced
this pull request
Jun 22, 2026
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
nkzhenhua
pushed a commit
to nkzhenhua/vllm
that referenced
this pull request
Jun 24, 2026
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.