[ROCm][DSv4] Functional fixes for DeepSeek V4 on MI300X/MI325X#45681
Conversation
DSv4 on AMD MI300X (gfx942) hits several FP8-related issues that this
commit addresses:
1. **fp8_utils.py**: ``process_fp8_weight_block_strategy`` calls
``normalize_e4m3fn_to_e4m3fnuz`` which doubles ``weight_scale`` by
``weight_scale * 2.0``. On models with UE8M0 scales
(``torch.float8_e8m0fnu``), that ``mul`` is not implemented on
CUDA/HIP and load aborts with::
NotImplementedError: "mul_cuda" not implemented for 'Float8_e8m0fnu'
UE8M0 stores power-of-2 exponent values (2^(exp-127)) with no
mantissa, so doubling the scale is equivalent to incrementing the
exponent byte by 1. Handle the UE8M0 case explicitly and fall back
to the float path otherwise.
2. **fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu**: gate ``kFp8Max``
to match the FNUZ/OCP path actually taken on each ROCm arch
(240 on gfx942 FNUZ, 448 on gfx950 OCP).
3. **deepseek_v4_attention.py** + **cache_utils.py**: small MI300X path
fixes that go with the FNUZ scale handling above.
Co-authored-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
``fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu`` selected its FP8 type and ``kFp8Max`` based purely on the HIP build macro ``HIP_FP8_TYPE_OCP``. That macro is set by the HIP runtime version, not by the target GPU arch -- on a HIP build that defines ``HIP_FP8_TYPE_OCP``, the kernel was using OCP E4M3 / ``448.0`` even on gfx942 (MI300X), whose MFMA instructions only accept FNUZ E4M3. The rest of vLLM's gfx942 path (Triton sparse-MLA, indexer Q quant, ``current_platform.fp8_dtype()``) all use FNUZ on this arch, so the C++ writer was producing K-cache entries the FNUZ readers misinterpret. Gate the OCP branch on ``defined(__gfx950__)`` so: * gfx942 (MI300X) -> ``__hip_fp8_e4m3_fnuz`` + ``kFp8Max = 240.0f`` * gfx950 (MI355X) -> ``__hip_fp8_e4m3`` + ``kFp8Max = 448.0f`` This matches the encoding chosen elsewhere on each arch. Signed-off-by: ganyi <ygan@amd.com> Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
The DSv4 sparse MLA Triton kernels added in vllm-project#41812 (and the matching turboquant store/decode kernels) bitcast uint8 to ``tl.float8e4b15`` when ``IS_FNUZ`` is true. ``float8e4b15`` is not a real Triton type; on AMD gfx942 (MI300X) Triton only supports the FP8 dtypes listed in the error from triton/compiler: ('fp8e4b8', 'fp8e4nv', 'fp8e5', 'fp8e5b16') The correct FNUZ E4M3 type is ``tl.float8e4b8`` (bias 8, matches the PyTorch ``torch.float8_e4m3fnuz`` used elsewhere on the MI300 path). The non-FNUZ branch already correctly uses ``tl.float8e4nv``. Without this fix, the very first profile run on MI300X with sparse MLA fails inside the dequant/gather kernel: type fp8e4b15 not supported in this architecture. This swaps all FNUZ branches to ``tl.float8e4b8``. Verified that ``IS_FNUZ`` is gated on ``current_platform.fp8_dtype() == torch.float8_e4m3fnuz`` so it never fires on OCP hardware. Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
``DeepseekV4ROCMAiterMLASparseImpl._forward_prefill_attn_impl`` in ``vllm/v1/attention/backends/mla/rocm_aiter_mla_sparse_dsv4.py`` is the actual ROCm path reached from ``DeepseekV4MLAAttention.forward`` at ``deepseek_v4_attention.py:762`` (``current_platform.is_rocm()``). ``DeepseekV4MLAAttention._forward_prefill`` in the same file is dead code on ROCm, so the previous ``kv.zero_()`` patch (commit 36a7037) fixed only the generic path. This ROCm-only forward also gets ``kv`` via ``current_workspace_manager().get_simultaneous(...)`` -- uninitialized shared memory reused across requests and layers -- writes only the compressed-K prefix and the SWA window for each chunk row, then reads the entire ``kv.view(-1, 1, head_dim)`` through ragged indices that can land on the holes for very short sequences. The result is exactly the symptom we observe on MI300X DSv4-Flash: 10 identical temperature=0 ``/v1/completions`` calls produce 10 distinct first tokens. Apply the same zero-init here. Cost is one bf16 fill of the workspace tile, dwarfed by the FP8 dequant + sparse attention. Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
PR vllm-project#42893 fixed the C++ SWA-K-cache encoder so it writes FNUZ E4M3 bytes on gfx942 (and OCP on gfx950) and updated the *generic* ``DeepseekV4MLAAttention._forward_prefill`` to call ``dequantize_and_gather_k_cache(..., use_fnuz=is_fp8_fnuz())`` for SWA and ``use_fnuz=False`` for the Triton-OCP-encoded compressed K cache. Two FP8-format mismatches remained on the actual ROCm DSv4 path (``DeepseekV4ROCMAiterMLASparseImpl``): 1. The public ``dequantize_and_gather_k_cache`` wrapper in ``vllm/v1/attention/ops/deepseek_v4_ops/cache_utils.py`` did not accept ``use_fnuz`` -- it silently dropped the kwarg when forwarding to ``dequantize_and_gather_k_cache_triton`` (which defaults to False). The ROCm prefill called the wrapper without ``use_fnuz``, so the SWA K cache (FNUZ on gfx942) was being read as OCP, scaling every K vector by ~448/240 in prefill attention. 2. ``_sparse_attn_decode_ragged_kernel`` in ``vllm/v1/attention/ops/rocm_aiter_mla_sparse.py`` decoded both the SWA (FNUZ on gfx942) and the compressed (always OCP) K caches with a single ``IS_FNUZ`` constexpr, so on MI300X the compressed-side branch reinterpreted OCP bytes as FNUZ -- the same encoder/decoder mismatch as (1) in the opposite direction (~240/448) on the decode side. Together these scrambled K vectors going into both prefill and decode attention, producing the GSM8K=0.005 gibberish PR vllm-project#42893 documented but could not explain with eager-vs-graphs. This commit: * Adds ``use_fnuz`` to the wrapper and forwards it to the Triton implementation (the cuteDSL path is dead on ROCm anyway). * Splits ``_sparse_attn_decode_ragged_kernel``'s ``IS_FNUZ`` into per-cache flags ``IS_FNUZ_MAIN`` (SWA) and ``IS_FNUZ_EXTRA`` (compressed) so each cache is decoded with its own encoder's format. * Wires ``DeepseekV4ROCMAiterMLASparseImpl._forward_prefill`` to pass ``use_fnuz=False`` for the compressed call (Triton-OCP encoder) and ``use_fnuz=current_platform.is_fp8_fnuz()`` for the SWA call (C++ FNUZ-on-gfx942 encoder), matching the asymmetry that PR vllm-project#42893's "[ROCm][DSv4] Fix compressed K cache dequant to match Triton OCP encoder" introduced for the generic path. Validated on 1 node x 4 x MI300X (gfx942), TP=4, VLLM_ROCM_USE_AITER=1, ``deepseek-ai/DeepSeek-V4-Flash``, both eager and CUDA-graphs ``FULL_AND_PIECEWISE`` configs from PR vllm-project#42810. GSM8K 5-shot, n=200, num_concurrent=32 against /v1/completions: | Mode | exact_match | Stderr | | ----- | ----------- | -------- | | Eager | 0.955 | +/-0.0147 | | Graph | 0.955 | +/-0.0147 | vs. the pre-fix 0.005 PR vllm-project#42893 reported on the same configuration. The two modes match each other to all three reported digits on both strict-match and flexible-extract filters. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
…ly path) PR review on vllm-project#42893 (gemini-code-assist) flagged that the three turboquant changes in commit 2bef91e ("[ROCm][DSv4] Use tl.float8e4b8 for FNUZ on MI300X sparse MLA kernels") are dead code on MI300X: they sit inside ``if FP8_E4B15:`` branches, and FP8_E4B15 is the constexpr returned by ``_use_fp8_e4b15(device)`` -- which is 1 only when ``torch.cuda.get_device_capability() < (8, 9)``. MI300X (gfx942) reports cap >= (9, x), so FP8_E4B15 = 0 on every AMD platform and the patched FNUZ branch is never executed. More importantly, the changes are *wrong* on the hardware where FP8_E4B15 = 1 -- NVIDIA Ampere/Ada (sm < 8.9). On those cards ``tl.float8e4b15`` (E4M3 with bias 15) is the correct Triton FP8 type for software emulation; ``tl.float8e4b8`` (E4M3 with bias 8) is the AMD-FNUZ-specific type and Triton on NVIDIA Ampere/Ada will reject it with the same "type not supported in this architecture" error the original commit was trying to fix. The original commit message conflated two unrelated gating constexprs (``IS_FNUZ`` in rocm_aiter_mla_sparse.py vs ``FP8_E4B15`` in the turboquant kernels). Only the rocm_aiter_mla_sparse.py hunks of 2bef91e are actually correct -- those are gated on ``current_platform.fp8_dtype() == torch.float8_e4m3fnuz`` / ``current_platform.is_fp8_fnuz()`` and are the ones that actually fix the MI300X sparse-MLA decode failure. Revert just the three turboquant lines back to ``tl.float8e4b15`` so the NVIDIA Ampere/Ada FP8 path is preserved. The MI300X fix in ``_sparse_attn_decode_ragged_kernel`` (the dequant/gather kernel cited in the original commit message) is unchanged. Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
The AITER wrapper bundled in the currently-pinned aiter wheel launches fp8_mqa_logits with (BLOCK_KV=128, num_stages=2) on gfx942. For the DSv4 sparse indexer shape (NUM_HEADS=64, HEAD_SIZE=128) this double-buffered KV tile + fp32 scores accumulator + Q tile pushes Triton's LDS request to 96 KiB, which exceeds MI300X's 64 KiB per CU. The launch JIT-aborts with OutOfResources on the first inference. The fix is upstreamed as ROCm/aiter#3257 but until vLLM bumps to an AITER version that contains it, this patch ships the same kernel + tile-size logic vendored into vllm/. - Add vllm/v1/attention/ops/triton_fp8_mqa_logits.py with a byte-for-byte copy of AITER's @triton.jit kernel and a Python wrapper that selects (BLOCK_KV=64, num_stages=1) (~33 KiB) when the default tile would not fit on gfx942 (see module docstring for the LDS budget calculation). - Route rocm_fp8_mqa_logits to the vendored kernel on gfx942 when AITER ops are enabled. gfx950+ and CUDA still use the upstream AITER wrapper (which has dedicated Gluon kernels this vendor copy does not include). - Fix a latent broadcasting bug in the torch reference fallback: the per-KV-token scale arrives as [N, 1] (a [N, 4] uint8 buffer view-cast to fp32) and was being multiplied against an [H, M, N] score tensor where PyTorch right-aligns [N, 1] against the M dim. Flatten to [N] so the multiply lines up with the last axis. Also drop a hard-coded 'cuda' device on the index tensors so the fallback works on ROCm with HIP devices. This entire patch is intended to be reverted once vLLM picks up an AITER version that includes ROCm/aiter#3257. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
…lm-project#42893) Switch the FNUZ branch of `kFp8Max` to 224.0 (was 240.0, the FNUZ dtype's raw representable max). 224.0 is what the rest of vLLM's FNUZ pipeline uses -- see `vllm/model_executor/layers/quantization/utils/fp8_utils.py:412-417`, which notes that 240.0 hurts dynamic-quant accuracy. The OCP branch (gfx950 + NVIDIA) keeps 448.0. Make the unit test honor the same split: add an optional `use_fnuz` constexpr to `quantize_and_insert_k_kernel` (default False, no production caller affected) and pick the encoding from `current_platform.is_fp8_fnuz()`. Byte-exact comparison now succeeds on both gfx942 and gfx950. Verified: 36/36 unit tests pass on MI300X (gfx942) and MI355X (gfx950). Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…review vllm-project#42893) Replace the long rationale around ``kv.zero_()`` (in both prefill paths) with a brief TODO that names the proper fix: mask invalid rows in the indexer (score = -inf) or in the sparse-attention kernel (skip indices >= valid_len). The current zero is the minimal interim workaround; the underlying bug is arch-independent (uninitialized workspace + indexer that scores the entire M dim) so the call stays unchanged on every platform until the indexer/kernel fix lands. No behavior change. Also condense the duplicate FNUZ-vs-OCP comments at the dequant call sites and in ``_sparse_attn_decode_ragged_kernel``: the wrapper docstring already explains the asymmetry, so per-call-site repetition was just noise. Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Resolve conflicts after the fused DeepSeek V4 kernel moved under libtorch_stable and upstream sparse MLA code changed. Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
Keep the upstream wrapper-constructor and reinterpret_cast FP8 byte emission in rocm_cvt_float_to_fp8_e4m3 so PR vllm-project#42893 stays focused on the FNUZ/OCP gating and kFp8Max=224 correctness fixes. The __hip_cvt_float_to_fp8-based helper is preserved in the issue notes and can be proposed separately if current ROCm builds require it. Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
The fused DeepSeek V4 K-cache test compared the kernel output to a PyTorch reference with exact bf16 byte equality on the RoPE region. The RoPE rotation is computed in fp32 on both sides and rounded to bf16; near a round-to-nearest tie, the fp32 GPU kernel and the fp32 torch reference can land on opposite sides and differ by one bf16 ULP. Replace exact cache byte equality with a decoded parity check: the deterministic NoPE UE8M0 FP8 round-trip must stay bit-identical, while the bf16 RoPE region is compared within 1 ULP. Apply the same check to the KV-path, DP-padding, and combined tests. Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
The per-chunk sparse-MLA prefill workspace is torch.empty and only its compressed-K prefix and SWA window are written, leaving unwritten holes. The prefill path previously zeroed the whole workspace defensively, which adds overhead. Combined indices produced by combine_topk_swa_indices only address written ranges: compressed topK entries outside [0, N) are dropped and SWA indices are bounded by gather_len. Runtime instrumentation on gfx942 across short and mixed-length sequences at low and high concurrency found zero out-of-range reads, so the defensive zeroing is unnecessary. Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
The full-cache per-tensor FP8 op writes FNUZ-encoded fp8 bytes on gfx942 into float8_e4m3fn-typed tensors, matching vLLM ROCm cache convention. The reference quantized with .to(torch.float8_e4m3fn), so on gfx942 it decoded the kernel FNUZ bytes under the wrong encoding. Encode the reference under FP8_STORE_DTYPE, reinterpret the kernel e4m3fn-typed outputs under the same scheme, and compare the decoded values. Keep bit-exact checks for the deterministic NoPE K-cache region and allow 1 fp8 ULP for Q and RoPE K-cache regions that can fall on opposite sides of a round-to-nearest tie. Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
de244f5 to
bb41eb9
Compare
Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
bb41eb9 to
63e107f
Compare
Keep the temporary vendored-kernel notes focused on the removal condition and avoid implying ROCm/aiter#3257 has already merged. Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
Make gfx942 the explicit FNUZ case for the fused DeepSeek V4 KV insert kernel and use the OCP path for other ROCm targets. Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
Yes true. I opened a separate PR for the fix: #45931. It adds an MHC-specific TileLang capability check that disables TileLang MHC only on gfx942, so MI300X/MI325X fall back to the existing torch/triton implementation while gfx950/MI355X keeps using TileLang. |
|
Hi @tuukkjs, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
|
Hi @tuukkjs, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
|
Hi @tuukkjs, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
|
Hi @tuukkjs, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
|
Hi @tuukkjs, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
Hi @tuukkjs, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
| import torch.nn as nn | ||
|
|
||
| from vllm.models.deepseek_v4.common.ops import fused_inv_rope_fp8_quant | ||
| from vllm.models.deepseek_v4.common.ops.fused_inv_rope_fp8_quant import ( |
There was a problem hiding this comment.
A remark:
import needs to be made explicit else mypy fail to resolve it and cause precommit error.
$ mypy --python-version 3.10
vllm/models/deepseek_v4/nvidia/ops/o_proj.py:48: error: Module not callable [operator]
Found 1 error in 1 file (checked 686 source files)
…project#45681) Signed-off-by: ganyi <ygan@amd.com> Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com> Co-authored-by: ganyi <ygan@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Jin Tao <jintao12@amd.com> Signed-off-by: divineearthly <divineearthly@gmail.com>
…project#45681) Signed-off-by: ganyi <ygan@amd.com> Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com> Co-authored-by: ganyi <ygan@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Jin Tao <jintao12@amd.com>
…project#45681) Signed-off-by: ganyi <ygan@amd.com> Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com> Co-authored-by: ganyi <ygan@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Jin Tao <jintao12@amd.com>
…project#45681) Signed-off-by: ganyi <ygan@amd.com> Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com> Co-authored-by: ganyi <ygan@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Jin Tao <jintao12@amd.com>
Purpose
This PR fixes DeepSeek V4 / DeepSeek V4 Flash functional issues on ROCm gfx942 (MI300X), while preserving the gfx950 path.
The main fixes are:
224.0.fp8_mqa_logitsso the DSv4 ROCm path does not depend on an immediate aiter update for that kernel.Continues #42893. AI assistance was used for this PR.
Co-authored-by: maeehart, markus.hartikainen@amd.com
Co-authored-by: ganyi, ygan@amd.com
Co-authored-by: Jin Tao, jintao12@amd.com
Test Plan
Build branch native extensions for gfx942 from:
Focused kernel test:
End-to-end validation:
deepseek-ai/DeepSeek-V4-Flashwith TP4,VLLM_ROCM_USE_AITER=1,--kv-cache-dtype fp8_e4m3,--tokenizer-mode deepseek_v4,--moe-backend triton_unfused,--max-model-len 32768, and--max-num-batched-tokens 8192.vllm bench servewith random 8192 input / 1024 output,--ignore-eos, and max concurrency1,2,4,8,16,32.Serve command:
export VLLM_ROCM_USE_AITER=1 vllm serve deepseek-ai/DeepSeek-V4-Flash \ --tensor-parallel-size 4 \ --kv-cache-dtype fp8_e4m3 \ --max-model-len 32768 \ --trust-remote-code \ --tokenizer-mode deepseek_v4 \ --moe-backend triton_unfused \ --gpu-memory-utilization 0.85 \ --distributed-executor-backend mp \ --max-num-batched-tokens 8192 \ --host 0.0.0.0 \ --port 8000 \ --disable-log-statsGSM8K lm_eval command:
Bench command:
Test Result
Validation stack for the MI300X run:
Focused Kernel Test
On gfx942:
The full-cache fp8/bf16 cases are included in this result.
MI300X End-To-End Validation
Model/server:
DeepSeek-V4-Flash, TP4, fp8_e4m3 KV cache, triton_unfused MoE backend
GSM8K full 1319-question, 5-shot, local completions, concurrency 128:
Clean
vllm bench serve, random 8192 input / 1024 output,--ignore-eos.All rows had zero failed requests:
Additional Validation With PR #43950 Applied
This run validates the same PR branch with vLLM PR #43950 applied locally, leaving TileLang installed. PR #43950 changes ROCm MHC dispatch so the MI300X/AITER path uses AITER MHC when available, rather than selecting the TileLang fused MHC path just because TileLang is importable.
Validation stack:
Focused Kernel Test With PR #43950
On gfx942:
MI300X End-To-End Validation With PR #43950
Model/server:
DeepSeek-V4-Flash, TP4, fp8_e4m3 KV cache, triton_unfused MoE backend, VLLM_ROCM_USE_AITER=1, tilelang installed
Prompt sanity check:
GSM8K full 1319-question, 5-shot, local completions, concurrency 128:
Clean
vllm bench serve, random 8192 input / 1024 output,--ignore-eos.All rows had zero failed requests: