[ROCm][DSv4] Functional fixes for DeepSeek V4 on MI300X/MI325X by tuukkjs · Pull Request #45681 · vllm-project/vllm

tuukkjs · 2026-06-15T09:53:27Z

Purpose

This PR fixes DeepSeek V4 / DeepSeek V4 Flash functional issues on ROCm gfx942 (MI300X), while preserving the gfx950 path.

The main fixes are:

Avoid unsupported FP8 arithmetic when converting UE8M0 block scales.
Make the fused DeepSeek V4 qnorm/RoPE/KV-insert path emit the correct FP8 bytes on ROCm:
- gfx942 uses FNUZ encoding.
- gfx950 uses OCP encoding.
- FNUZ uses max value 224.0.
Propagate the FNUZ/OCP choice through DeepSeek V4 cache utility helpers and sparse MLA decode paths.
Keep the sparse decode cache formats explicit: the main/SWA cache follows the platform FP8 encoding, while the compressed extra cache remains OCP.
Provide the gfx942 Triton fallback for fp8_mqa_logits so the DSv4 ROCm path does not depend on an immediate aiter update for that kernel.
Fix the fused-kernel test references:
- allow a narrow 1 bf16 ULP tolerance for the RoPE region, where the GPU kernel and PyTorch reference can round opposite ways at ties;
- use the correct ROCm FP8 store dtype in full-cache per-tensor FP8 reference checks, so gfx942 FNUZ bytes are decoded under the same encoding used by the kernel.

Continues #42893. AI assistance was used for this PR.

Co-authored-by: maeehart, markus.hartikainen@amd.com
Co-authored-by: ganyi, ygan@amd.com
Co-authored-by: Jin Tao, jintao12@amd.com

Test Plan

Build branch native extensions for gfx942 from:

vllm/vllm-openai-rocm:base-nightly-54bbf5166842932fa7abc34a14df850594daeb5e

Focused kernel test:

python3 -m pytest -q tests/kernels/test_fused_deepseek_v4_qnorm_rope_kv_insert.py

End-to-end validation:

Start deepseek-ai/DeepSeek-V4-Flash with TP4, VLLM_ROCM_USE_AITER=1, --kv-cache-dtype fp8_e4m3, --tokenizer-mode deepseek_v4, --moe-backend triton_unfused, --max-model-len 32768, and --max-num-batched-tokens 8192.
Disable the tilelang MHC path for this validation so the run uses the coherent non-tilelang MHC path; tilelang MHC correctness on gfx942 is a separate issue from this PR.
Run a prompt sanity check against the OpenAI-compatible completions endpoint.
Run full GSM8K, 5-shot, 1319 questions, local completions, concurrency 128.
Run vllm bench serve with random 8192 input / 1024 output, --ignore-eos, and max concurrency 1,2,4,8,16,32.

Serve command:

export VLLM_ROCM_USE_AITER=1

vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 32768 \
  --trust-remote-code \
  --tokenizer-mode deepseek_v4 \
  --moe-backend triton_unfused \
  --gpu-memory-utilization 0.85 \
  --distributed-executor-backend mp \
  --max-num-batched-tokens 8192 \
  --host 0.0.0.0 \
  --port 8000 \
  --disable-log-stats

GSM8K lm_eval command:

lm_eval --model local-completions \
  --model_args model=deepseek-ai/DeepSeek-V4-Flash,base_url=http://localhost:8000/v1/completions,num_concurrent=128,tokenized_requests=False,tokenizer_backend=none,max_length=32768 \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size 128

Bench command:

for C in 1 2 4 8 16 32; do
  vllm bench serve \
    --backend vllm \
    --base-url http://127.0.0.1:8000 \
    --model deepseek-ai/DeepSeek-V4-Flash \
    --tokenizer deepseek-ai/DeepSeek-V4-Flash \
    --trust-remote-code \
    --dataset-name random \
    --random-input-len 8192 \
    --random-output-len 1024 \
    --num-prompts $((3*C)) \
    --max-concurrency "${C}" \
    --ignore-eos \
    --seed 0
done

Test Result

Validation stack for the MI300X run:

vllm/vllm-openai-rocm:base-nightly-54bbf5166842932fa7abc34a14df850594daeb5e
  + this PR branch source
  + this PR branch native extensions rebuilt for gfx942
  + tilelang uninstalled for server/GSM8K validation to use the coherent non-tilelang MHC path

Focused Kernel Test

On gfx942:

tests/kernels/test_fused_deepseek_v4_qnorm_rope_kv_insert.py
155 passed

The full-cache fp8/bf16 cases are included in this result.

MI300X End-To-End Validation

Model/server:
DeepSeek-V4-Flash, TP4, fp8_e4m3 KV cache, triton_unfused MoE backend

GSM8K full 1319-question, 5-shot, local completions, concurrency 128:

Task	Filter	n-shot	Metric	Value	Stderr
gsm8k	flexible-extract	5	exact_match	0.9568	0.0056
gsm8k	strict-match	5	exact_match	0.9568	0.0056

Clean vllm bench serve, random 8192 input / 1024 output, --ignore-eos.
All rows had zero failed requests:

Max concurrency	Successful	Output tok/s	Total tok/s	Median TTFT (ms)	Median TPOT (ms)
1	3	17.16	154.41	196.24	58.16
2	6	33.50	301.48	317.18	59.46
4	12	65.47	589.22	535.18	60.63
8	24	125.91	1133.18	912.01	62.71
16	48	230.80	2077.16	1676.96	67.79
32	96	404.87	3643.80	3122.19	76.03

Additional Validation With PR #43950 Applied

This run validates the same PR branch with vLLM PR #43950 applied locally, leaving TileLang installed. PR #43950 changes ROCm MHC dispatch so the MI300X/AITER path uses AITER MHC when available, rather than selecting the TileLang fused MHC path just because TileLang is importable.

Validation stack:

vllm/vllm-openai-rocm:base-nightly-b8336c3c7c298e0878f22a7bf70f4e295b2f4e01
  + this PR branch source
  + vLLM PR #43950 applied locally
  + native extensions rebuilt for gfx942
  + tilelang installed and importable: tilelang==0.1.10

Focused Kernel Test With PR #43950

On gfx942:

tests/kernels/test_fused_deepseek_v4_qnorm_rope_kv_insert.py
155 passed

MI300X End-To-End Validation With PR #43950

Model/server:
DeepSeek-V4-Flash, TP4, fp8_e4m3 KV cache, triton_unfused MoE backend, VLLM_ROCM_USE_AITER=1, tilelang installed

Prompt sanity check:

Q: What is 17 + 25? Answer briefly.
A: 42.

GSM8K full 1319-question, 5-shot, local completions, concurrency 128:

Task	Filter	n-shot	Metric	Value	Stderr
gsm8k	flexible-extract	5	exact_match	0.9560	0.0056
gsm8k	strict-match	5	exact_match	0.9553	0.0057

Clean vllm bench serve, random 8192 input / 1024 output, --ignore-eos.
All rows had zero failed requests:

Max concurrency	Successful	Output tok/s	Total tok/s	Median TTFT (ms)	Median TPOT (ms)
1	3	31.09	279.84	1473.94	30.35
2	6	60.26	542.34	1570.09	31.21
4	12	115.31	1037.76	2394.49	31.62
8	24	204.51	1840.55	3127.39	36.02
16	48	327.45	2947.09	3288.08	43.95
32	96	478.61	4307.45	3390.08	61.28

DSv4 on AMD MI300X (gfx942) hits several FP8-related issues that this commit addresses: 1. **fp8_utils.py**: ``process_fp8_weight_block_strategy`` calls ``normalize_e4m3fn_to_e4m3fnuz`` which doubles ``weight_scale`` by ``weight_scale * 2.0``. On models with UE8M0 scales (``torch.float8_e8m0fnu``), that ``mul`` is not implemented on CUDA/HIP and load aborts with:: NotImplementedError: "mul_cuda" not implemented for 'Float8_e8m0fnu' UE8M0 stores power-of-2 exponent values (2^(exp-127)) with no mantissa, so doubling the scale is equivalent to incrementing the exponent byte by 1. Handle the UE8M0 case explicitly and fall back to the float path otherwise. 2. **fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu**: gate ``kFp8Max`` to match the FNUZ/OCP path actually taken on each ROCm arch (240 on gfx942 FNUZ, 448 on gfx950 OCP). 3. **deepseek_v4_attention.py** + **cache_utils.py**: small MI300X path fixes that go with the FNUZ scale handling above. Co-authored-by: ganyi <ygan@amd.com> Signed-off-by: ganyi <ygan@amd.com> Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

``fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu`` selected its FP8 type and ``kFp8Max`` based purely on the HIP build macro ``HIP_FP8_TYPE_OCP``. That macro is set by the HIP runtime version, not by the target GPU arch -- on a HIP build that defines ``HIP_FP8_TYPE_OCP``, the kernel was using OCP E4M3 / ``448.0`` even on gfx942 (MI300X), whose MFMA instructions only accept FNUZ E4M3. The rest of vLLM's gfx942 path (Triton sparse-MLA, indexer Q quant, ``current_platform.fp8_dtype()``) all use FNUZ on this arch, so the C++ writer was producing K-cache entries the FNUZ readers misinterpret. Gate the OCP branch on ``defined(__gfx950__)`` so: * gfx942 (MI300X) -> ``__hip_fp8_e4m3_fnuz`` + ``kFp8Max = 240.0f`` * gfx950 (MI355X) -> ``__hip_fp8_e4m3`` + ``kFp8Max = 448.0f`` This matches the encoding chosen elsewhere on each arch. Signed-off-by: ganyi <ygan@amd.com> Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

The DSv4 sparse MLA Triton kernels added in vllm-project#41812 (and the matching turboquant store/decode kernels) bitcast uint8 to ``tl.float8e4b15`` when ``IS_FNUZ`` is true. ``float8e4b15`` is not a real Triton type; on AMD gfx942 (MI300X) Triton only supports the FP8 dtypes listed in the error from triton/compiler: ('fp8e4b8', 'fp8e4nv', 'fp8e5', 'fp8e5b16') The correct FNUZ E4M3 type is ``tl.float8e4b8`` (bias 8, matches the PyTorch ``torch.float8_e4m3fnuz`` used elsewhere on the MI300 path). The non-FNUZ branch already correctly uses ``tl.float8e4nv``. Without this fix, the very first profile run on MI300X with sparse MLA fails inside the dequant/gather kernel: type fp8e4b15 not supported in this architecture. This swaps all FNUZ branches to ``tl.float8e4b8``. Verified that ``IS_FNUZ`` is gated on ``current_platform.fp8_dtype() == torch.float8_e4m3fnuz`` so it never fires on OCP hardware. Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>

``DeepseekV4ROCMAiterMLASparseImpl._forward_prefill_attn_impl`` in ``vllm/v1/attention/backends/mla/rocm_aiter_mla_sparse_dsv4.py`` is the actual ROCm path reached from ``DeepseekV4MLAAttention.forward`` at ``deepseek_v4_attention.py:762`` (``current_platform.is_rocm()``). ``DeepseekV4MLAAttention._forward_prefill`` in the same file is dead code on ROCm, so the previous ``kv.zero_()`` patch (commit 36a7037) fixed only the generic path. This ROCm-only forward also gets ``kv`` via ``current_workspace_manager().get_simultaneous(...)`` -- uninitialized shared memory reused across requests and layers -- writes only the compressed-K prefix and the SWA window for each chunk row, then reads the entire ``kv.view(-1, 1, head_dim)`` through ragged indices that can land on the holes for very short sequences. The result is exactly the symptom we observe on MI300X DSv4-Flash: 10 identical temperature=0 ``/v1/completions`` calls produce 10 distinct first tokens. Apply the same zero-init here. Cost is one bf16 fill of the workspace tile, dwarfed by the FP8 dequant + sparse attention. Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

PR vllm-project#42893 fixed the C++ SWA-K-cache encoder so it writes FNUZ E4M3 bytes on gfx942 (and OCP on gfx950) and updated the *generic* ``DeepseekV4MLAAttention._forward_prefill`` to call ``dequantize_and_gather_k_cache(..., use_fnuz=is_fp8_fnuz())`` for SWA and ``use_fnuz=False`` for the Triton-OCP-encoded compressed K cache. Two FP8-format mismatches remained on the actual ROCm DSv4 path (``DeepseekV4ROCMAiterMLASparseImpl``): 1. The public ``dequantize_and_gather_k_cache`` wrapper in ``vllm/v1/attention/ops/deepseek_v4_ops/cache_utils.py`` did not accept ``use_fnuz`` -- it silently dropped the kwarg when forwarding to ``dequantize_and_gather_k_cache_triton`` (which defaults to False). The ROCm prefill called the wrapper without ``use_fnuz``, so the SWA K cache (FNUZ on gfx942) was being read as OCP, scaling every K vector by ~448/240 in prefill attention. 2. ``_sparse_attn_decode_ragged_kernel`` in ``vllm/v1/attention/ops/rocm_aiter_mla_sparse.py`` decoded both the SWA (FNUZ on gfx942) and the compressed (always OCP) K caches with a single ``IS_FNUZ`` constexpr, so on MI300X the compressed-side branch reinterpreted OCP bytes as FNUZ -- the same encoder/decoder mismatch as (1) in the opposite direction (~240/448) on the decode side. Together these scrambled K vectors going into both prefill and decode attention, producing the GSM8K=0.005 gibberish PR vllm-project#42893 documented but could not explain with eager-vs-graphs. This commit: * Adds ``use_fnuz`` to the wrapper and forwards it to the Triton implementation (the cuteDSL path is dead on ROCm anyway). * Splits ``_sparse_attn_decode_ragged_kernel``'s ``IS_FNUZ`` into per-cache flags ``IS_FNUZ_MAIN`` (SWA) and ``IS_FNUZ_EXTRA`` (compressed) so each cache is decoded with its own encoder's format. * Wires ``DeepseekV4ROCMAiterMLASparseImpl._forward_prefill`` to pass ``use_fnuz=False`` for the compressed call (Triton-OCP encoder) and ``use_fnuz=current_platform.is_fp8_fnuz()`` for the SWA call (C++ FNUZ-on-gfx942 encoder), matching the asymmetry that PR vllm-project#42893's "[ROCm][DSv4] Fix compressed K cache dequant to match Triton OCP encoder" introduced for the generic path. Validated on 1 node x 4 x MI300X (gfx942), TP=4, VLLM_ROCM_USE_AITER=1, ``deepseek-ai/DeepSeek-V4-Flash``, both eager and CUDA-graphs ``FULL_AND_PIECEWISE`` configs from PR vllm-project#42810. GSM8K 5-shot, n=200, num_concurrent=32 against /v1/completions: | Mode | exact_match | Stderr | | ----- | ----------- | -------- | | Eager | 0.955 | +/-0.0147 | | Graph | 0.955 | +/-0.0147 | vs. the pre-fix 0.005 PR vllm-project#42893 reported on the same configuration. The two modes match each other to all three reported digits on both strict-match and flexible-extract filters. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>

…ly path) PR review on vllm-project#42893 (gemini-code-assist) flagged that the three turboquant changes in commit 2bef91e ("[ROCm][DSv4] Use tl.float8e4b8 for FNUZ on MI300X sparse MLA kernels") are dead code on MI300X: they sit inside ``if FP8_E4B15:`` branches, and FP8_E4B15 is the constexpr returned by ``_use_fp8_e4b15(device)`` -- which is 1 only when ``torch.cuda.get_device_capability() < (8, 9)``. MI300X (gfx942) reports cap >= (9, x), so FP8_E4B15 = 0 on every AMD platform and the patched FNUZ branch is never executed. More importantly, the changes are *wrong* on the hardware where FP8_E4B15 = 1 -- NVIDIA Ampere/Ada (sm < 8.9). On those cards ``tl.float8e4b15`` (E4M3 with bias 15) is the correct Triton FP8 type for software emulation; ``tl.float8e4b8`` (E4M3 with bias 8) is the AMD-FNUZ-specific type and Triton on NVIDIA Ampere/Ada will reject it with the same "type not supported in this architecture" error the original commit was trying to fix. The original commit message conflated two unrelated gating constexprs (``IS_FNUZ`` in rocm_aiter_mla_sparse.py vs ``FP8_E4B15`` in the turboquant kernels). Only the rocm_aiter_mla_sparse.py hunks of 2bef91e are actually correct -- those are gated on ``current_platform.fp8_dtype() == torch.float8_e4m3fnuz`` / ``current_platform.is_fp8_fnuz()`` and are the ones that actually fix the MI300X sparse-MLA decode failure. Revert just the three turboquant lines back to ``tl.float8e4b15`` so the NVIDIA Ampere/Ada FP8 path is preserved. The MI300X fix in ``_sparse_attn_decode_ragged_kernel`` (the dequant/gather kernel cited in the original commit message) is unchanged. Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

The AITER wrapper bundled in the currently-pinned aiter wheel launches fp8_mqa_logits with (BLOCK_KV=128, num_stages=2) on gfx942. For the DSv4 sparse indexer shape (NUM_HEADS=64, HEAD_SIZE=128) this double-buffered KV tile + fp32 scores accumulator + Q tile pushes Triton's LDS request to 96 KiB, which exceeds MI300X's 64 KiB per CU. The launch JIT-aborts with OutOfResources on the first inference. The fix is upstreamed as ROCm/aiter#3257 but until vLLM bumps to an AITER version that contains it, this patch ships the same kernel + tile-size logic vendored into vllm/. - Add vllm/v1/attention/ops/triton_fp8_mqa_logits.py with a byte-for-byte copy of AITER's @triton.jit kernel and a Python wrapper that selects (BLOCK_KV=64, num_stages=1) (~33 KiB) when the default tile would not fit on gfx942 (see module docstring for the LDS budget calculation). - Route rocm_fp8_mqa_logits to the vendored kernel on gfx942 when AITER ops are enabled. gfx950+ and CUDA still use the upstream AITER wrapper (which has dedicated Gluon kernels this vendor copy does not include). - Fix a latent broadcasting bug in the torch reference fallback: the per-KV-token scale arrives as [N, 1] (a [N, 4] uint8 buffer view-cast to fp32) and was being multiplied against an [H, M, N] score tensor where PyTorch right-aligns [N, 1] against the M dim. Flatten to [N] so the multiply lines up with the last axis. Also drop a hard-coded 'cuda' device on the index tensors so the fallback works on ROCm with HIP devices. This entire patch is intended to be reverted once vLLM picks up an AITER version that includes ROCm/aiter#3257. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>

…lm-project#42893) Switch the FNUZ branch of `kFp8Max` to 224.0 (was 240.0, the FNUZ dtype's raw representable max). 224.0 is what the rest of vLLM's FNUZ pipeline uses -- see `vllm/model_executor/layers/quantization/utils/fp8_utils.py:412-417`, which notes that 240.0 hurts dynamic-quant accuracy. The OCP branch (gfx950 + NVIDIA) keeps 448.0. Make the unit test honor the same split: add an optional `use_fnuz` constexpr to `quantize_and_insert_k_kernel` (default False, no production caller affected) and pick the encoding from `current_platform.is_fp8_fnuz()`. Byte-exact comparison now succeeds on both gfx942 and gfx950. Verified: 36/36 unit tests pass on MI300X (gfx942) and MI355X (gfx950). Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…review vllm-project#42893) Replace the long rationale around ``kv.zero_()`` (in both prefill paths) with a brief TODO that names the proper fix: mask invalid rows in the indexer (score = -inf) or in the sparse-attention kernel (skip indices >= valid_len). The current zero is the minimal interim workaround; the underlying bug is arch-independent (uninitialized workspace + indexer that scores the entire M dim) so the call stays unchanged on every platform until the indexer/kernel fix lands. No behavior change. Also condense the duplicate FNUZ-vs-OCP comments at the dequant call sites and in ``_sparse_attn_decode_ragged_kernel``: the wrapper docstring already explains the asymmetry, so per-call-site repetition was just noise. Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Resolve conflicts after the fused DeepSeek V4 kernel moved under libtorch_stable and upstream sparse MLA code changed. Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>

Keep the upstream wrapper-constructor and reinterpret_cast FP8 byte emission in rocm_cvt_float_to_fp8_e4m3 so PR vllm-project#42893 stays focused on the FNUZ/OCP gating and kFp8Max=224 correctness fixes. The __hip_cvt_float_to_fp8-based helper is preserved in the issue notes and can be proposed separately if current ROCm builds require it. Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>

The fused DeepSeek V4 K-cache test compared the kernel output to a PyTorch reference with exact bf16 byte equality on the RoPE region. The RoPE rotation is computed in fp32 on both sides and rounded to bf16; near a round-to-nearest tie, the fp32 GPU kernel and the fp32 torch reference can land on opposite sides and differ by one bf16 ULP. Replace exact cache byte equality with a decoded parity check: the deterministic NoPE UE8M0 FP8 round-trip must stay bit-identical, while the bf16 RoPE region is compared within 1 ULP. Apply the same check to the KV-path, DP-padding, and combined tests. Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>

The per-chunk sparse-MLA prefill workspace is torch.empty and only its compressed-K prefix and SWA window are written, leaving unwritten holes. The prefill path previously zeroed the whole workspace defensively, which adds overhead. Combined indices produced by combine_topk_swa_indices only address written ranges: compressed topK entries outside [0, N) are dropped and SWA indices are bounded by gather_len. Runtime instrumentation on gfx942 across short and mixed-length sequences at low and high concurrency found zero out-of-range reads, so the defensive zeroing is unnecessary. Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>

The full-cache per-tensor FP8 op writes FNUZ-encoded fp8 bytes on gfx942 into float8_e4m3fn-typed tensors, matching vLLM ROCm cache convention. The reference quantized with .to(torch.float8_e4m3fn), so on gfx942 it decoded the kernel FNUZ bytes under the wrong encoding. Encode the reference under FP8_STORE_DTYPE, reinterpret the kernel e4m3fn-typed outputs under the same scheme, and compare the decoded values. Keep bit-exact checks for the deterministic NoPE K-cache region and allow 1 fp8 ULP for Q and RoPE K-cache regions that can fall on opposite sides of a round-to-nearest tie. Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>

Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>

Keep the temporary vendored-kernel notes focused on the removal condition and avoid implying ROCm/aiter#3257 has already merged. Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>

Make gfx942 the explicit FNUZ case for the fused DeepSeek V4 KV insert kernel and use the OCP path for other ROCm targets. Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>

tuukkjs · 2026-06-17T13:51:33Z

Disable the tilelang MHC path for this validation so the run uses the coherent non-tilelang MHC path; tilelang MHC correctness on gfx942 is a separate issue from this PR.

I noticed this line. If this is the case we should add the _on_gfx942 condition to the mhc so that it is using torch implementation instead.

So that we can use it on mi300x out of the box, without the need to uninstall tilelang to make it work on mi300x. We must ensure that the mi355x still uses the tilelang code path.

Yes true. I opened a separate PR for the fix: #45931.

It adds an MHC-specific TileLang capability check that disables TileLang MHC only on gfx942, so MI300X/MI325X fall back to the existing torch/triton implementation while gfx950/MI355X keeps using TileLang.

tjtanaa

LGTM. Thanks. Let's land this and I will quickly look at the tilelang PR. All coauthors are included.

mergify · 2026-06-17T16:32:59Z

Hi @tuukkjs, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files