Skip to content

[ROCm][DSv4] Functional fixes for DeepSeek V4 on MI300X/MI325X#45681

Merged
tjtanaa merged 26 commits into
vllm-project:mainfrom
tuukkjs:dsv4-rocm-mi300x-fixes
Jun 18, 2026
Merged

[ROCm][DSv4] Functional fixes for DeepSeek V4 on MI300X/MI325X#45681
tjtanaa merged 26 commits into
vllm-project:mainfrom
tuukkjs:dsv4-rocm-mi300x-fixes

Conversation

@tuukkjs

@tuukkjs tuukkjs commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Purpose

This PR fixes DeepSeek V4 / DeepSeek V4 Flash functional issues on ROCm gfx942 (MI300X), while preserving the gfx950 path.

The main fixes are:

  • Avoid unsupported FP8 arithmetic when converting UE8M0 block scales.
  • Make the fused DeepSeek V4 qnorm/RoPE/KV-insert path emit the correct FP8 bytes on ROCm:
    • gfx942 uses FNUZ encoding.
    • gfx950 uses OCP encoding.
    • FNUZ uses max value 224.0.
  • Propagate the FNUZ/OCP choice through DeepSeek V4 cache utility helpers and sparse MLA decode paths.
  • Keep the sparse decode cache formats explicit: the main/SWA cache follows the platform FP8 encoding, while the compressed extra cache remains OCP.
  • Provide the gfx942 Triton fallback for fp8_mqa_logits so the DSv4 ROCm path does not depend on an immediate aiter update for that kernel.
  • Fix the fused-kernel test references:
    • allow a narrow 1 bf16 ULP tolerance for the RoPE region, where the GPU kernel and PyTorch reference can round opposite ways at ties;
    • use the correct ROCm FP8 store dtype in full-cache per-tensor FP8 reference checks, so gfx942 FNUZ bytes are decoded under the same encoding used by the kernel.

Continues #42893. AI assistance was used for this PR.

Co-authored-by: maeehart, markus.hartikainen@amd.com
Co-authored-by: ganyi, ygan@amd.com
Co-authored-by: Jin Tao, jintao12@amd.com

Test Plan

Build branch native extensions for gfx942 from:

vllm/vllm-openai-rocm:base-nightly-54bbf5166842932fa7abc34a14df850594daeb5e

Focused kernel test:

python3 -m pytest -q tests/kernels/test_fused_deepseek_v4_qnorm_rope_kv_insert.py

End-to-end validation:

  • Start deepseek-ai/DeepSeek-V4-Flash with TP4, VLLM_ROCM_USE_AITER=1, --kv-cache-dtype fp8_e4m3, --tokenizer-mode deepseek_v4, --moe-backend triton_unfused, --max-model-len 32768, and --max-num-batched-tokens 8192.
  • Disable the tilelang MHC path for this validation so the run uses the coherent non-tilelang MHC path; tilelang MHC correctness on gfx942 is a separate issue from this PR.
  • Run a prompt sanity check against the OpenAI-compatible completions endpoint.
  • Run full GSM8K, 5-shot, 1319 questions, local completions, concurrency 128.
  • Run vllm bench serve with random 8192 input / 1024 output, --ignore-eos, and max concurrency 1,2,4,8,16,32.

Serve command:

export VLLM_ROCM_USE_AITER=1

vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 32768 \
  --trust-remote-code \
  --tokenizer-mode deepseek_v4 \
  --moe-backend triton_unfused \
  --gpu-memory-utilization 0.85 \
  --distributed-executor-backend mp \
  --max-num-batched-tokens 8192 \
  --host 0.0.0.0 \
  --port 8000 \
  --disable-log-stats

GSM8K lm_eval command:

lm_eval --model local-completions \
  --model_args model=deepseek-ai/DeepSeek-V4-Flash,base_url=http://localhost:8000/v1/completions,num_concurrent=128,tokenized_requests=False,tokenizer_backend=none,max_length=32768 \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size 128

Bench command:

for C in 1 2 4 8 16 32; do
  vllm bench serve \
    --backend vllm \
    --base-url http://127.0.0.1:8000 \
    --model deepseek-ai/DeepSeek-V4-Flash \
    --tokenizer deepseek-ai/DeepSeek-V4-Flash \
    --trust-remote-code \
    --dataset-name random \
    --random-input-len 8192 \
    --random-output-len 1024 \
    --num-prompts $((3*C)) \
    --max-concurrency "${C}" \
    --ignore-eos \
    --seed 0
done

Test Result

Validation stack for the MI300X run:

vllm/vllm-openai-rocm:base-nightly-54bbf5166842932fa7abc34a14df850594daeb5e
  + this PR branch source
  + this PR branch native extensions rebuilt for gfx942
  + tilelang uninstalled for server/GSM8K validation to use the coherent non-tilelang MHC path

Focused Kernel Test

On gfx942:

tests/kernels/test_fused_deepseek_v4_qnorm_rope_kv_insert.py
155 passed

The full-cache fp8/bf16 cases are included in this result.

MI300X End-To-End Validation

Model/server:
DeepSeek-V4-Flash, TP4, fp8_e4m3 KV cache, triton_unfused MoE backend

GSM8K full 1319-question, 5-shot, local completions, concurrency 128:

Task Filter n-shot Metric Value Stderr
gsm8k flexible-extract 5 exact_match 0.9568 0.0056
gsm8k strict-match 5 exact_match 0.9568 0.0056

Clean vllm bench serve, random 8192 input / 1024 output, --ignore-eos.
All rows had zero failed requests:

Max concurrency Successful Failed Output tok/s Total tok/s Median TTFT (ms) Median TPOT (ms)
1 3 0 17.16 154.41 196.24 58.16
2 6 0 33.50 301.48 317.18 59.46
4 12 0 65.47 589.22 535.18 60.63
8 24 0 125.91 1133.18 912.01 62.71
16 48 0 230.80 2077.16 1676.96 67.79
32 96 0 404.87 3643.80 3122.19 76.03

Additional Validation With PR #43950 Applied

This run validates the same PR branch with vLLM PR #43950 applied locally, leaving TileLang installed. PR #43950 changes ROCm MHC dispatch so the MI300X/AITER path uses AITER MHC when available, rather than selecting the TileLang fused MHC path just because TileLang is importable.

Validation stack:

vllm/vllm-openai-rocm:base-nightly-b8336c3c7c298e0878f22a7bf70f4e295b2f4e01
  + this PR branch source
  + vLLM PR #43950 applied locally
  + native extensions rebuilt for gfx942
  + tilelang installed and importable: tilelang==0.1.10

Focused Kernel Test With PR #43950

On gfx942:

tests/kernels/test_fused_deepseek_v4_qnorm_rope_kv_insert.py
155 passed

MI300X End-To-End Validation With PR #43950

Model/server:
DeepSeek-V4-Flash, TP4, fp8_e4m3 KV cache, triton_unfused MoE backend, VLLM_ROCM_USE_AITER=1, tilelang installed

Prompt sanity check:

Q: What is 17 + 25? Answer briefly.
A: 42.

GSM8K full 1319-question, 5-shot, local completions, concurrency 128:

Task Filter n-shot Metric Value Stderr
gsm8k flexible-extract 5 exact_match 0.9560 0.0056
gsm8k strict-match 5 exact_match 0.9553 0.0057

Clean vllm bench serve, random 8192 input / 1024 output, --ignore-eos.
All rows had zero failed requests:

Max concurrency Successful Failed Output tok/s Total tok/s Median TTFT (ms) Median TPOT (ms)
1 3 0 31.09 279.84 1473.94 30.35
2 6 0 60.26 542.34 1570.09 31.21
4 12 0 115.31 1037.76 2394.49 31.62
8 24 0 204.51 1840.55 3127.39 36.02
16 48 0 327.45 2947.09 3288.08 43.95
32 96 0 478.61 4307.45 3390.08 61.28
ganyi1996ppo and others added 15 commits May 25, 2026 10:28
DSv4 on AMD MI300X (gfx942) hits several FP8-related issues that this
commit addresses:

1. **fp8_utils.py**: ``process_fp8_weight_block_strategy`` calls
   ``normalize_e4m3fn_to_e4m3fnuz`` which doubles ``weight_scale`` by
   ``weight_scale * 2.0``. On models with UE8M0 scales
   (``torch.float8_e8m0fnu``), that ``mul`` is not implemented on
   CUDA/HIP and load aborts with::

     NotImplementedError: "mul_cuda" not implemented for 'Float8_e8m0fnu'

   UE8M0 stores power-of-2 exponent values (2^(exp-127)) with no
   mantissa, so doubling the scale is equivalent to incrementing the
   exponent byte by 1. Handle the UE8M0 case explicitly and fall back
   to the float path otherwise.

2. **fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu**: gate ``kFp8Max``
   to match the FNUZ/OCP path actually taken on each ROCm arch
   (240 on gfx942 FNUZ, 448 on gfx950 OCP).

3. **deepseek_v4_attention.py** + **cache_utils.py**: small MI300X path
   fixes that go with the FNUZ scale handling above.

Co-authored-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
``fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu`` selected its FP8
type and ``kFp8Max`` based purely on the HIP build macro
``HIP_FP8_TYPE_OCP``. That macro is set by the HIP runtime version,
not by the target GPU arch -- on a HIP build that defines
``HIP_FP8_TYPE_OCP``, the kernel was using OCP E4M3 / ``448.0`` even on
gfx942 (MI300X), whose MFMA instructions only accept FNUZ E4M3.

The rest of vLLM's gfx942 path (Triton sparse-MLA, indexer Q quant,
``current_platform.fp8_dtype()``) all use FNUZ on this arch, so the C++
writer was producing K-cache entries the FNUZ readers misinterpret.

Gate the OCP branch on ``defined(__gfx950__)`` so:

* gfx942 (MI300X) -> ``__hip_fp8_e4m3_fnuz`` + ``kFp8Max = 240.0f``
* gfx950 (MI355X) -> ``__hip_fp8_e4m3``       + ``kFp8Max = 448.0f``

This matches the encoding chosen elsewhere on each arch.

Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
The DSv4 sparse MLA Triton kernels added in vllm-project#41812 (and the matching
turboquant store/decode kernels) bitcast uint8 to ``tl.float8e4b15``
when ``IS_FNUZ`` is true. ``float8e4b15`` is not a real Triton type;
on AMD gfx942 (MI300X) Triton only supports the FP8 dtypes listed in
the error from triton/compiler:

  ('fp8e4b8', 'fp8e4nv', 'fp8e5', 'fp8e5b16')

The correct FNUZ E4M3 type is ``tl.float8e4b8`` (bias 8, matches the
PyTorch ``torch.float8_e4m3fnuz`` used elsewhere on the MI300 path).
The non-FNUZ branch already correctly uses ``tl.float8e4nv``.

Without this fix, the very first profile run on MI300X with sparse
MLA fails inside the dequant/gather kernel:

  type fp8e4b15 not supported in this architecture.

This swaps all FNUZ branches to ``tl.float8e4b8``. Verified that
``IS_FNUZ`` is gated on ``current_platform.fp8_dtype() ==
torch.float8_e4m3fnuz`` so it never fires on OCP hardware.

Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
``DeepseekV4ROCMAiterMLASparseImpl._forward_prefill_attn_impl`` in
``vllm/v1/attention/backends/mla/rocm_aiter_mla_sparse_dsv4.py`` is the
actual ROCm path reached from ``DeepseekV4MLAAttention.forward`` at
``deepseek_v4_attention.py:762`` (``current_platform.is_rocm()``).
``DeepseekV4MLAAttention._forward_prefill`` in the same file is dead
code on ROCm, so the previous ``kv.zero_()`` patch (commit 36a7037)
fixed only the generic path.

This ROCm-only forward also gets ``kv`` via
``current_workspace_manager().get_simultaneous(...)`` -- uninitialized
shared memory reused across requests and layers -- writes only the
compressed-K prefix and the SWA window for each chunk row, then reads
the entire ``kv.view(-1, 1, head_dim)`` through ragged indices that
can land on the holes for very short sequences. The result is exactly
the symptom we observe on MI300X DSv4-Flash: 10 identical temperature=0
``/v1/completions`` calls produce 10 distinct first tokens.

Apply the same zero-init here. Cost is one bf16 fill of the workspace
tile, dwarfed by the FP8 dequant + sparse attention.

Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
PR vllm-project#42893 fixed the C++ SWA-K-cache encoder so it writes FNUZ E4M3 bytes
on gfx942 (and OCP on gfx950) and updated the *generic*
``DeepseekV4MLAAttention._forward_prefill`` to call
``dequantize_and_gather_k_cache(..., use_fnuz=is_fp8_fnuz())`` for SWA
and ``use_fnuz=False`` for the Triton-OCP-encoded compressed K cache.
Two FP8-format mismatches remained on the actual ROCm DSv4 path
(``DeepseekV4ROCMAiterMLASparseImpl``):

1. The public ``dequantize_and_gather_k_cache`` wrapper in
   ``vllm/v1/attention/ops/deepseek_v4_ops/cache_utils.py`` did not
   accept ``use_fnuz`` -- it silently dropped the kwarg when forwarding
   to ``dequantize_and_gather_k_cache_triton`` (which defaults to
   False). The ROCm prefill called the wrapper without ``use_fnuz``,
   so the SWA K cache (FNUZ on gfx942) was being read as OCP, scaling
   every K vector by ~448/240 in prefill attention.

2. ``_sparse_attn_decode_ragged_kernel`` in
   ``vllm/v1/attention/ops/rocm_aiter_mla_sparse.py`` decoded both the
   SWA (FNUZ on gfx942) and the compressed (always OCP) K caches with
   a single ``IS_FNUZ`` constexpr, so on MI300X the compressed-side
   branch reinterpreted OCP bytes as FNUZ -- the same encoder/decoder
   mismatch as (1) in the opposite direction (~240/448) on the decode
   side.

Together these scrambled K vectors going into both prefill and decode
attention, producing the GSM8K=0.005 gibberish PR vllm-project#42893 documented
but could not explain with eager-vs-graphs.

This commit:

* Adds ``use_fnuz`` to the wrapper and forwards it to the Triton
  implementation (the cuteDSL path is dead on ROCm anyway).
* Splits ``_sparse_attn_decode_ragged_kernel``'s ``IS_FNUZ`` into
  per-cache flags ``IS_FNUZ_MAIN`` (SWA) and ``IS_FNUZ_EXTRA``
  (compressed) so each cache is decoded with its own encoder's format.
* Wires ``DeepseekV4ROCMAiterMLASparseImpl._forward_prefill`` to pass
  ``use_fnuz=False`` for the compressed call (Triton-OCP encoder) and
  ``use_fnuz=current_platform.is_fp8_fnuz()`` for the SWA call (C++
  FNUZ-on-gfx942 encoder), matching the asymmetry that PR vllm-project#42893's
  "[ROCm][DSv4] Fix compressed K cache dequant to match Triton OCP
  encoder" introduced for the generic path.

Validated on 1 node x 4 x MI300X (gfx942), TP=4,
VLLM_ROCM_USE_AITER=1, ``deepseek-ai/DeepSeek-V4-Flash``, both eager
and CUDA-graphs ``FULL_AND_PIECEWISE`` configs from PR vllm-project#42810. GSM8K
5-shot, n=200, num_concurrent=32 against /v1/completions:

| Mode  | exact_match | Stderr   |
| ----- | ----------- | -------- |
| Eager | 0.955       | +/-0.0147 |
| Graph | 0.955       | +/-0.0147 |

vs. the pre-fix 0.005 PR vllm-project#42893 reported on the same configuration.
The two modes match each other to all three reported digits on both
strict-match and flexible-extract filters.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
…ly path)

PR review on vllm-project#42893 (gemini-code-assist) flagged that the three
turboquant changes in commit 2bef91e ("[ROCm][DSv4] Use tl.float8e4b8
for FNUZ on MI300X sparse MLA kernels") are dead code on MI300X: they
sit inside ``if FP8_E4B15:`` branches, and FP8_E4B15 is the constexpr
returned by ``_use_fp8_e4b15(device)`` -- which is 1 only when
``torch.cuda.get_device_capability() < (8, 9)``. MI300X (gfx942)
reports cap >= (9, x), so FP8_E4B15 = 0 on every AMD platform and the
patched FNUZ branch is never executed.

More importantly, the changes are *wrong* on the hardware where
FP8_E4B15 = 1 -- NVIDIA Ampere/Ada (sm < 8.9). On those cards
``tl.float8e4b15`` (E4M3 with bias 15) is the correct Triton FP8 type
for software emulation; ``tl.float8e4b8`` (E4M3 with bias 8) is the
AMD-FNUZ-specific type and Triton on NVIDIA Ampere/Ada will reject it
with the same "type not supported in this architecture" error the
original commit was trying to fix.

The original commit message conflated two unrelated gating constexprs
(``IS_FNUZ`` in rocm_aiter_mla_sparse.py vs ``FP8_E4B15`` in the
turboquant kernels). Only the rocm_aiter_mla_sparse.py hunks of
2bef91e are actually correct -- those are gated on
``current_platform.fp8_dtype() == torch.float8_e4m3fnuz`` /
``current_platform.is_fp8_fnuz()`` and are the ones that actually fix
the MI300X sparse-MLA decode failure.

Revert just the three turboquant lines back to ``tl.float8e4b15`` so
the NVIDIA Ampere/Ada FP8 path is preserved. The MI300X fix in
``_sparse_attn_decode_ragged_kernel`` (the dequant/gather kernel cited
in the original commit message) is unchanged.

Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
The AITER wrapper bundled in the currently-pinned aiter wheel
launches fp8_mqa_logits with (BLOCK_KV=128, num_stages=2) on gfx942.
For the DSv4 sparse indexer shape (NUM_HEADS=64, HEAD_SIZE=128) this
double-buffered KV tile + fp32 scores accumulator + Q tile pushes
Triton's LDS request to 96 KiB, which exceeds MI300X's 64 KiB per
CU. The launch JIT-aborts with OutOfResources on the first
inference. The fix is upstreamed as ROCm/aiter#3257 but until vLLM
bumps to an AITER version that contains it, this patch ships the
same kernel + tile-size logic vendored into vllm/.

- Add vllm/v1/attention/ops/triton_fp8_mqa_logits.py with a
  byte-for-byte copy of AITER's @triton.jit kernel and a
  Python wrapper that selects (BLOCK_KV=64, num_stages=1) (~33 KiB)
  when the default tile would not fit on gfx942 (see module
  docstring for the LDS budget calculation).
- Route rocm_fp8_mqa_logits to the vendored kernel on gfx942 when
  AITER ops are enabled. gfx950+ and CUDA still use the upstream
  AITER wrapper (which has dedicated Gluon kernels this vendor
  copy does not include).
- Fix a latent broadcasting bug in the torch reference fallback:
  the per-KV-token scale arrives as [N, 1] (a [N, 4] uint8 buffer
  view-cast to fp32) and was being multiplied against an [H, M, N]
  score tensor where PyTorch right-aligns [N, 1] against the M
  dim. Flatten to [N] so the multiply lines up with the last
  axis. Also drop a hard-coded 'cuda' device on the index tensors
  so the fallback works on ROCm with HIP devices.

This entire patch is intended to be reverted once vLLM picks up an
AITER version that includes ROCm/aiter#3257.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
…lm-project#42893)

Switch the FNUZ branch of `kFp8Max` to 224.0 (was 240.0, the FNUZ dtype's raw representable max). 224.0 is what the rest of vLLM's FNUZ pipeline uses -- see `vllm/model_executor/layers/quantization/utils/fp8_utils.py:412-417`, which notes that 240.0 hurts dynamic-quant accuracy. The OCP branch (gfx950 + NVIDIA) keeps 448.0.

Make the unit test honor the same split: add an optional `use_fnuz` constexpr to `quantize_and_insert_k_kernel` (default False, no production caller affected) and pick the encoding from `current_platform.is_fp8_fnuz()`. Byte-exact comparison now succeeds on both gfx942 and gfx950.

Verified: 36/36 unit tests pass on MI300X (gfx942) and MI355X (gfx950).

Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…review vllm-project#42893)

Replace the long rationale around ``kv.zero_()`` (in both prefill paths) with a brief TODO that names the proper fix: mask invalid rows in the indexer (score = -inf) or in the sparse-attention kernel (skip indices >= valid_len). The current zero is the minimal interim workaround; the underlying bug is arch-independent (uninitialized workspace + indexer that scores the entire M dim) so the call stays unchanged on every platform until the indexer/kernel fix lands. No behavior change.

Also condense the duplicate FNUZ-vs-OCP comments at the dequant call sites and in ``_sparse_attn_decode_ragged_kernel``: the wrapper docstring already explains the asymmetry, so per-call-site repetition was just noise.

Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Resolve conflicts after the fused DeepSeek V4 kernel moved under libtorch_stable and upstream sparse MLA code changed.

Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
Keep the upstream wrapper-constructor and reinterpret_cast FP8 byte emission in rocm_cvt_float_to_fp8_e4m3 so PR vllm-project#42893 stays focused on the FNUZ/OCP gating and kFp8Max=224 correctness fixes. The __hip_cvt_float_to_fp8-based helper is preserved in the issue notes and can be proposed separately if current ROCm builds require it.

Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
The fused DeepSeek V4 K-cache test compared the kernel output to a PyTorch reference with exact bf16 byte equality on the RoPE region. The RoPE rotation is computed in fp32 on both sides and rounded to bf16; near a round-to-nearest tie, the fp32 GPU kernel and the fp32 torch reference can land on opposite sides and differ by one bf16 ULP.

Replace exact cache byte equality with a decoded parity check: the deterministic NoPE UE8M0 FP8 round-trip must stay bit-identical, while the bf16 RoPE region is compared within 1 ULP. Apply the same check to the KV-path, DP-padding, and combined tests.

Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
The per-chunk sparse-MLA prefill workspace is torch.empty and only its compressed-K prefix and SWA window are written, leaving unwritten holes. The prefill path previously zeroed the whole workspace defensively, which adds overhead.

Combined indices produced by combine_topk_swa_indices only address written ranges: compressed topK entries outside [0, N) are dropped and SWA indices are bounded by gather_len. Runtime instrumentation on gfx942 across short and mixed-length sequences at low and high concurrency found zero out-of-range reads, so the defensive zeroing is unnecessary.

Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
The full-cache per-tensor FP8 op writes FNUZ-encoded fp8 bytes on gfx942 into float8_e4m3fn-typed tensors, matching vLLM ROCm cache convention. The reference quantized with .to(torch.float8_e4m3fn), so on gfx942 it decoded the kernel FNUZ bytes under the wrong encoding.

Encode the reference under FP8_STORE_DTYPE, reinterpret the kernel e4m3fn-typed outputs under the same scheme, and compare the decoded values. Keep bit-exact checks for the deterministic NoPE K-cache region and allow 1 fp8 ULP for Q and RoPE K-cache regions that can fall on opposite sides of a round-to-nearest tie.

Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
@mergify mergify Bot added deepseek Related to DeepSeek models rocm Related to AMD ROCm labels Jun 15, 2026
@mergify mergify Bot added the v1 label Jun 15, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Jun 15, 2026
Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
@tuukkjs tuukkjs force-pushed the dsv4-rocm-mi300x-fixes branch from de244f5 to bb41eb9 Compare June 16, 2026 06:49
Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
@tuukkjs tuukkjs force-pushed the dsv4-rocm-mi300x-fixes branch from bb41eb9 to 63e107f Compare June 16, 2026 06:56
@tuukkjs tuukkjs marked this pull request as ready for review June 16, 2026 09:10
tuukkjs added 2 commits June 17, 2026 13:29
Keep the temporary vendored-kernel notes focused on the removal condition and avoid implying ROCm/aiter#3257 has already merged.

Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
Make gfx942 the explicit FNUZ case for the fused DeepSeek V4 KV insert kernel and use the OCP path for other ROCm targets.

Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
@tuukkjs

tuukkjs commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author
  • Disable the tilelang MHC path for this validation so the run uses the coherent non-tilelang MHC path; tilelang MHC correctness on gfx942 is a separate issue from this PR.

I noticed this line. If this is the case we should add the _on_gfx942 condition to the mhc so that it is using torch implementation instead.

So that we can use it on mi300x out of the box, without the need to uninstall tilelang to make it work on mi300x. We must ensure that the mi355x still uses the tilelang code path.

Yes true. I opened a separate PR for the fix: #45931.

It adds an MHC-specific TileLang capability check that disables TileLang MHC only on gfx942, so MI300X/MI325X fall back to the existing torch/triton implementation while gfx950/MI355X keeps using TileLang.

@tjtanaa tjtanaa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks. Let's land this and I will quickly look at the tilelang PR. All coauthors are included.

@tjtanaa tjtanaa added ready ONLY add when PR is ready to merge/full CI is needed DSv4 labels Jun 17, 2026
@mergify

mergify Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Hi @tuukkjs, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@tjtanaa tjtanaa enabled auto-merge (squash) June 17, 2026 16:33
@mergify

mergify Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Hi @tuukkjs, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@mergify

mergify Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Hi @tuukkjs, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@mergify

mergify Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Hi @tuukkjs, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@mergify

mergify Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Hi @tuukkjs, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
@mergify

mergify Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Hi @tuukkjs, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
import torch.nn as nn

from vllm.models.deepseek_v4.common.ops import fused_inv_rope_fp8_quant
from vllm.models.deepseek_v4.common.ops.fused_inv_rope_fp8_quant import (

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A remark:

import needs to be made explicit else mypy fail to resolve it and cause precommit error.

$ mypy --python-version 3.10 
vllm/models/deepseek_v4/nvidia/ops/o_proj.py:48: error: Module not callable  [operator]
Found 1 error in 1 file (checked 686 source files)
@tjtanaa tjtanaa merged commit afdcbd5 into vllm-project:main Jun 18, 2026
195 of 196 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Jun 18, 2026
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…project#45681)

Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
Co-authored-by: ganyi <ygan@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: Jin Tao <jintao12@amd.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Jun 21, 2026
…project#45681)

Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
Co-authored-by: ganyi <ygan@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: Jin Tao <jintao12@amd.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…project#45681)

Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
Co-authored-by: ganyi <ygan@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: Jin Tao <jintao12@amd.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…project#45681)

Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
Co-authored-by: ganyi <ygan@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: Jin Tao <jintao12@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models DSv4 ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

5 participants