Skip to content

[Attention] add triton diff-kv backend for mimo#41797

Merged
mgoin merged 6 commits into
vllm-project:mainfrom
ZJY0516:triton-diff-kv
Jun 11, 2026
Merged

[Attention] add triton diff-kv backend for mimo#41797
mgoin merged 6 commits into
vllm-project:mainfrom
ZJY0516:triton-diff-kv

Conversation

@ZJY0516

@ZJY0516 ZJY0516 commented May 6, 2026

Copy link
Copy Markdown
Member

Purpose

Fix #41519

Test Plan

vllm serve XiaomiMiMo/MiMo-V2.5 -tp 4 --trust-remote-code
vllm serve XiaomiMiMo/MiMo-V2.5 -tp 4 --trust-remote-code --attention-backend TRITON_ATTN_DIFFKV
lm_eval --model local-completions --model_args "model=XiaomiMiMo/MiMo-V2.5,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=256,timeout=5000,max_length=4096" --tasks gsm8k --num_fewshot 5

Test Result

FA

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9378 ± 0.0067
strict-match 5 exact_match 0.9371 ± 0.0067

triton

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9325 ± 0.0069
strict-match 5 exact_match 0.9340 ± 0.0068

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify

mergify Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor
@mergify mergify Bot added documentation Improvements or additions to documentation v1 labels May 6, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the TRITON_ATTN_DIFFKV backend and a corresponding Triton kernel to support models with differing K and V head dimensions, such as MiMo-V2. It also updates the model executor to dynamically select between FlashAttention and Triton DiffKV backends based on device compatibility. Review feedback suggests including float16 in the supported KV cache data types and explicitly overriding supports_attn_type to restrict the backend to decoder-only attention, aligning with the kernel's implementation.

Comment on lines +42 to +45
supported_kv_cache_dtypes: ClassVar[list[CacheDType]] = [
"auto",
"bfloat16",
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The supported_kv_cache_dtypes list is missing float16, although the comment above explicitly states that fp16 is supported. This will cause validation errors if a user explicitly sets kv_cache_dtype="float16" for a model using this backend.

Suggested change
supported_kv_cache_dtypes: ClassVar[list[CacheDType]] = [
"auto",
"bfloat16",
]
supported_kv_cache_dtypes: ClassVar[list[CacheDType]] = [
"auto",
"float16",
"bfloat16",
]
Comment thread vllm/v1/attention/backends/triton_attn_diffkv.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be535f6d0e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +42 to +45
supported_kv_cache_dtypes: ClassVar[list[CacheDType]] = [
"auto",
"bfloat16",
]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't advertise explicit bfloat16 KV cache

When MiMo uses TRITON_ATTN_DIFFKV with --kv-cache-dtype bfloat16, this entry lets the backend pass capability checks, but do_kv_cache_update() immediately calls triton_reshape_and_cache_flash_diffkv(), whose assertion only accepts "auto" or quantized cache dtype strings; quantized modes are rejected by this impl as well. In that explicit-bfloat16 configuration the first cache update will fail at runtime, so this backend should either only advertise "auto" or update the cache helper to accept explicit bfloat16.

Useful? React with 👍 / 👎.

ZJY0516 added 2 commits May 6, 2026 08:56
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
@idonati

idonati commented May 9, 2026

Copy link
Copy Markdown

@ZJY0516 — testing this PR on our 8× DGX Spark / TP=8 stack tonight. Wanted to flag a couple of practical observations + a likely blocker for other people trying it.

What works (confirmed before runtime test)

  • Auto-detection logic in mimo_v2.py:296-310 is exactly right for sm_12x / GB10 — we built ZJY0516/vllm@triton-diff-kv and verified is_supported_on_current_device correctly returns False for FlashAttentionDiffKVBackend (FA3 only on SM 9.0), so the dispatch falls through to TRITON_ATTN_DIFFKV automatically.
  • The new --attention-backend triton_attn_diffkv CLI override path works too (we pass it explicitly in our recipe).
  • Backend registry change in registry.py looks clean.

Practical blocker for users running NVFP4 quants

This isn't your PR's fault, but worth flagging here for anyone reading and trying the same thing:

lukealonso/MiMo-V2.5-NVFP4 has empty auto_map in config.json (the NVFP4 quantization process apparently strips it). With trust_remote_code=True, vLLM still fails because:

  1. transformers (5.8.0 / git main) doesn't natively register mimo_v2 model_type yet — CONFIG_MAPPING_NAMES doesn't have it.
  2. vLLM's _CONFIG_REGISTRY in transformers_utils/config.py has mimo_v2_omni but not plain text-only mimo_v2 — only the omni/vision variant has a registered config class in vLLM.
  3. Result: AutoConfig.from_pretrained fails with Value error, The checkpoint you are trying to load has model type mimo_v2 but Transformers does not recognize this architecture BEFORE vLLM's MimoV2ModelArchConfigConvertor ever gets to run.

Workaround we used: copied configuration_mimo_v2.py and modeling_mimo_v2.py from XiaomiMiMo/MiMo-V2.5 (the original, which has them) into the lukealonso snapshot directory + patched config.json to re-add the auto_map pointing at those files. Then trust_remote_code=True works.

For shadowlilac/MiMo-V2.5-NVFP4 (alternative quant) the auto_map is intact, so that one Just Works™.

A small suggestion

Could be worth adding a MiMoV2Config to vllm/transformers_utils/configs/ and registering it in _CONFIG_REGISTRY under mimo_v2 (mirroring what's already done for mimo_v2_omni). That would let any MiMo NVFP4 quant (whether or not the publisher kept auto_map) load on vLLM without needing transformers to add native support. It's literally a parallel of mimo_v2_omni.py for the text-only path — should be small.

On deck

Will report back with full launch + 30-prompt sweep + accuracy spot-check shortly. Pairing this PR's branch with NCCL 2.30.4 (per @jasl's suggestion in #40969 that resolved the Ray Compiled-DAG wedge for our 8× Spark setup), so MiMo on this stack gets a clean baseline.

cc @jasl, @shadowlilac-oss, @haosdent — sharing context.

@idonati

idonati commented May 9, 2026

Copy link
Copy Markdown

Update from our 8× DGX Spark / TP=8 test of lukealonso/MiMo-V2.5-NVFP4. Auto_map workaround worked, the new TRITON_ATTN_DIFFKV backend dispatch worked correctly on sm_121, but we hit a third blocker — NVFP4 quant + TP=8 + MiMo's num_key_value_heads=4 is structurally incompatible with vLLM's KV-head replication path. Posting in case it helps the next person trying this.

Boot trace + error

Engine got past auto_map, past arch resolution (Resolved architecture: MiMoV2OmniForCausalLM), past memory profiling, all the way to weight loading per-rank. Failed at mimo_v2.py:627default_weight_loader:

AssertionError: Attempted to load weight (torch.Size([1696, 4096])) into parameter (torch.Size([1856, 4096]))

Shape math

MiMo-V2.5 config: num_attention_heads=64, num_key_value_heads=4, head_dim=192, v_head_dim=128 (DiffKV).

At TP=8, num_key_value_heads=4 doesn't divide cleanly. There are two ways to handle this:

  1. KV head replication (vLLM's standard): num_kv_heads_per_rank = max(1, kv_heads // tp) → 1 replicated KV head per rank. Per-rank QKV combined: 8*192 + 1*192 + 1*128 = 1856.
  2. Fractional KV sharding: 0.5 KV head per rank. Per-rank QKV combined: 8*192 + 0.5*192 + 0.5*128 = 1536 + 96 + 64 = 1696.

The lukealonso NVFP4 quant was apparently prepared assuming approach (2) — fractional sharding. vLLM at TP=8 expects approach (1) — replication. Off by 160 (the missing 0.5 KV head's worth of K+V dims).

Practical workarounds for users

  • Use TP that divides num_kv_heads=4: TP=1, 2, or 4. At TP=4 across 4 of our 8 GB10s: per-rank weight = ~22 GB which fits comfortably in 121 GiB unified memory, but leaves 4 nodes idle.
  • Re-quantize from XiaomiMiMo/MiMo-V2.5 with TP=8 in mind (replicated KV layout). lukealonso may regenerate.
  • Try shadowlilac/MiMo-V2.5-NVFP4 to see if their quant uses replication. (We haven't tested this yet — the auto_map at least is intact in shadowlilac's version.)

Note: this isn't your PR's issue (#41797 is purely the FA3 → Triton diff-KV backend fallback, which appears to dispatch correctly based on the boot logs). It's a quant-vs-vllm-loader layout mismatch for the specific MiMo arch + TP combination.

Net impact

For us: defer MiMo-V2.5 deployment on this 8-node cluster until either a TP=8-compatible NVFP4 quant exists, or until we're willing to run TP=4 with 4 idle nodes. Will keep tracking.

Posting cross-reference to #41519 since this is likely something other DGX Spark / TP=8 users will hit.

@idonati

idonati commented May 16, 2026

Copy link
Copy Markdown

Quick follow-up for anyone tracking this PR: thanks @ZJY0516 — the DiffKV Triton backend in this PR cleanly resolves the first two blockers I flagged earlier (model_type discovery + attention-backend dispatch for unequal Q/V head_dim).

The third blocker (degenerate logits at inference, あるい... / *\0... style token loops) turned out to be not a kernel-side problem. It's a separate bug in the same file: mimo_v2.py:load_weights uses loaded_weight.chunk(tp_size, dim=0)[tp_rank] for fused-qkv_proj checkpoints, which mis-slots Q values into K/V parameter slots on tp_size - 1 of tp_size ranks.

Filed separately at #42803 with empirical evidence + a proposed patch. Verified working end-to-end on the same 8× DGX Spark / TP=8 setup with festr2/MiMo-V2.5-Pro-NVFP4-MXFP8-attn-TP8 (coherent chat + completion output).

@mgoin mgoin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, although a few items for cleanup and a unit test would be good to get in. cc @LucasWilkinson @MatthewBonanni as another attention backend

Comment thread vllm/v1/attention/backends/triton_attn_diffkv.py Outdated
Comment thread vllm/v1/attention/ops/triton_unified_attention_diffkv.py
Comment thread vllm/v1/attention/ops/triton_unified_attention_diffkv.py
ZJY0516 added 2 commits June 11, 2026 12:48
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
@ZJY0516 ZJY0516 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 11, 2026
@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Hi @ZJY0516, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@mergify mergify Bot added the ci/build label Jun 11, 2026
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

@mgoin mgoin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay since this is pretty separable and thus easy to delete if there are issues, I'm good with merging. Thanks for keeping it clean!

@mgoin mgoin merged commit f81daf8 into vllm-project:main Jun 11, 2026
106 checks passed
@ZJY0516 ZJY0516 deleted the triton-diff-kv branch June 11, 2026 15:53
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed v1

3 participants