[Attention] add triton diff-kv backend for mimo by ZJY0516 · Pull Request #41797 · vllm-project/vllm

ZJY0516 · 2026-05-06T08:25:42Z

Purpose

Test Plan

vllm serve XiaomiMiMo/MiMo-V2.5 -tp 4 --trust-remote-code
vllm serve XiaomiMiMo/MiMo-V2.5 -tp 4 --trust-remote-code --attention-backend TRITON_ATTN_DIFFKV

lm_eval --model local-completions --model_args "model=XiaomiMiMo/MiMo-V2.5,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=256,timeout=5000,max_length=4096" --tasks gsm8k --num_fewshot 5

Test Result

FA

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9378	±	0.0067
		strict-match	5	exact_match	↑	0.9371	±	0.0067

triton

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9325	±	0.0069
		strict-match	5	exact_match	↑	0.9340	±	0.0068

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-05-06T08:26:42Z

Documentation preview: https://vllm--41797.org.readthedocs.build/en/41797/

gemini-code-assist

Code Review

This pull request introduces the TRITON_ATTN_DIFFKV backend and a corresponding Triton kernel to support models with differing K and V head dimensions, such as MiMo-V2. It also updates the model executor to dynamically select between FlashAttention and Triton DiffKV backends based on device compatibility. Review feedback suggests including float16 in the supported KV cache data types and explicitly overriding supports_attn_type to restrict the backend to decoder-only attention, aligning with the kernel's implementation.

gemini-code-assist · 2026-05-06T08:28:20Z

+    supported_kv_cache_dtypes: ClassVar[list[CacheDType]] = [
+        "auto",
+        "bfloat16",
+    ]


The supported_kv_cache_dtypes list is missing float16, although the comment above explicitly states that fp16 is supported. This will cause validation errors if a user explicitly sets kv_cache_dtype="float16" for a model using this backend.

Suggested change

supported_kv_cache_dtypes: ClassVar[list[CacheDType]] = [

"auto",

"bfloat16",

]

supported_kv_cache_dtypes: ClassVar[list[CacheDType]] = [

"auto",

"float16",

"bfloat16",

]

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be535f6d0e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-06T08:29:42Z

+    supported_kv_cache_dtypes: ClassVar[list[CacheDType]] = [
+        "auto",
+        "bfloat16",
+    ]


Don't advertise explicit bfloat16 KV cache

When MiMo uses TRITON_ATTN_DIFFKV with --kv-cache-dtype bfloat16, this entry lets the backend pass capability checks, but do_kv_cache_update() immediately calls triton_reshape_and_cache_flash_diffkv(), whose assertion only accepts "auto" or quantized cache dtype strings; quantized modes are rejected by this impl as well. In that explicit-bfloat16 configuration the first cache update will fail at runtime, so this backend should either only advertise "auto" or update the cache helper to accept explicit bfloat16.

Useful? React with 👍 / 👎.

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

idonati · 2026-05-09T22:16:59Z

@ZJY0516 — testing this PR on our 8× DGX Spark / TP=8 stack tonight. Wanted to flag a couple of practical observations + a likely blocker for other people trying it.

What works (confirmed before runtime test)

Auto-detection logic in mimo_v2.py:296-310 is exactly right for sm_12x / GB10 — we built ZJY0516/vllm@triton-diff-kv and verified is_supported_on_current_device correctly returns False for FlashAttentionDiffKVBackend (FA3 only on SM 9.0), so the dispatch falls through to TRITON_ATTN_DIFFKV automatically.
The new --attention-backend triton_attn_diffkv CLI override path works too (we pass it explicitly in our recipe).
Backend registry change in registry.py looks clean.

Practical blocker for users running NVFP4 quants

This isn't your PR's fault, but worth flagging here for anyone reading and trying the same thing:

lukealonso/MiMo-V2.5-NVFP4 has empty auto_map in config.json (the NVFP4 quantization process apparently strips it). With trust_remote_code=True, vLLM still fails because:

transformers (5.8.0 / git main) doesn't natively register mimo_v2 model_type yet — CONFIG_MAPPING_NAMES doesn't have it.
vLLM's _CONFIG_REGISTRY in transformers_utils/config.py has mimo_v2_omni but not plain text-only mimo_v2 — only the omni/vision variant has a registered config class in vLLM.
Result: AutoConfig.from_pretrained fails with Value error, The checkpoint you are trying to load has model type mimo_v2 but Transformers does not recognize this architecture BEFORE vLLM's MimoV2ModelArchConfigConvertor ever gets to run.

Workaround we used: copied configuration_mimo_v2.py and modeling_mimo_v2.py from XiaomiMiMo/MiMo-V2.5 (the original, which has them) into the lukealonso snapshot directory + patched config.json to re-add the auto_map pointing at those files. Then trust_remote_code=True works.

For shadowlilac/MiMo-V2.5-NVFP4 (alternative quant) the auto_map is intact, so that one Just Works™.

A small suggestion

Could be worth adding a MiMoV2Config to vllm/transformers_utils/configs/ and registering it in _CONFIG_REGISTRY under mimo_v2 (mirroring what's already done for mimo_v2_omni). That would let any MiMo NVFP4 quant (whether or not the publisher kept auto_map) load on vLLM without needing transformers to add native support. It's literally a parallel of mimo_v2_omni.py for the text-only path — should be small.

On deck

Will report back with full launch + 30-prompt sweep + accuracy spot-check shortly. Pairing this PR's branch with NCCL 2.30.4 (per @jasl's suggestion in #40969 that resolved the Ray Compiled-DAG wedge for our 8× Spark setup), so MiMo on this stack gets a clean baseline.

cc @jasl, @shadowlilac-oss, @haosdent — sharing context.

idonati · 2026-05-09T22:19:56Z

Update from our 8× DGX Spark / TP=8 test of lukealonso/MiMo-V2.5-NVFP4. Auto_map workaround worked, the new TRITON_ATTN_DIFFKV backend dispatch worked correctly on sm_121, but we hit a third blocker — NVFP4 quant + TP=8 + MiMo's num_key_value_heads=4 is structurally incompatible with vLLM's KV-head replication path. Posting in case it helps the next person trying this.

Boot trace + error

Engine got past auto_map, past arch resolution (Resolved architecture: MiMoV2OmniForCausalLM), past memory profiling, all the way to weight loading per-rank. Failed at mimo_v2.py:627 → default_weight_loader:

AssertionError: Attempted to load weight (torch.Size([1696, 4096])) into parameter (torch.Size([1856, 4096]))

Shape math

MiMo-V2.5 config: num_attention_heads=64, num_key_value_heads=4, head_dim=192, v_head_dim=128 (DiffKV).

At TP=8, num_key_value_heads=4 doesn't divide cleanly. There are two ways to handle this:

KV head replication (vLLM's standard): num_kv_heads_per_rank = max(1, kv_heads // tp) → 1 replicated KV head per rank. Per-rank QKV combined: 8*192 + 1*192 + 1*128 = 1856.
Fractional KV sharding: 0.5 KV head per rank. Per-rank QKV combined: 8*192 + 0.5*192 + 0.5*128 = 1536 + 96 + 64 = 1696.

The lukealonso NVFP4 quant was apparently prepared assuming approach (2) — fractional sharding. vLLM at TP=8 expects approach (1) — replication. Off by 160 (the missing 0.5 KV head's worth of K+V dims).

Practical workarounds for users

Use TP that divides num_kv_heads=4: TP=1, 2, or 4. At TP=4 across 4 of our 8 GB10s: per-rank weight = ~22 GB which fits comfortably in 121 GiB unified memory, but leaves 4 nodes idle.
Re-quantize from XiaomiMiMo/MiMo-V2.5 with TP=8 in mind (replicated KV layout). lukealonso may regenerate.
Try shadowlilac/MiMo-V2.5-NVFP4 to see if their quant uses replication. (We haven't tested this yet — the auto_map at least is intact in shadowlilac's version.)

Note: this isn't your PR's issue (#41797 is purely the FA3 → Triton diff-KV backend fallback, which appears to dispatch correctly based on the boot logs). It's a quant-vs-vllm-loader layout mismatch for the specific MiMo arch + TP combination.

Net impact

For us: defer MiMo-V2.5 deployment on this 8-node cluster until either a TP=8-compatible NVFP4 quant exists, or until we're willing to run TP=4 with 4 idle nodes. Will keep tracking.

Posting cross-reference to #41519 since this is likely something other DGX Spark / TP=8 users will hit.

idonati · 2026-05-16T02:38:45Z

Quick follow-up for anyone tracking this PR: thanks @ZJY0516 — the DiffKV Triton backend in this PR cleanly resolves the first two blockers I flagged earlier (model_type discovery + attention-backend dispatch for unequal Q/V head_dim).

The third blocker (degenerate logits at inference, あるい... / *\0... style token loops) turned out to be not a kernel-side problem. It's a separate bug in the same file: mimo_v2.py:load_weights uses loaded_weight.chunk(tp_size, dim=0)[tp_rank] for fused-qkv_proj checkpoints, which mis-slots Q values into K/V parameter slots on tp_size - 1 of tp_size ranks.

Filed separately at #42803 with empirical evidence + a proposed patch. Verified working end-to-end on the same 8× DGX Spark / TP=8 setup with festr2/MiMo-V2.5-Pro-NVFP4-MXFP8-attn-TP8 (coherent chat + completion output).

mgoin

Looks reasonable to me, although a few items for cleanup and a unit test would be good to get in. cc @LucasWilkinson @MatthewBonanni as another attention backend

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

mergify · 2026-06-11T13:24:41Z

Hi @ZJY0516, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

mgoin

Okay since this is pretty separable and thus easy to delete if there are issues, I'm good with merging. Thanks for keeping it clean!

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

init

be535f6

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

ZJY0516 requested review from LucasWilkinson and MatthewBonanni as code owners May 6, 2026 08:25

claude Bot reviewed May 6, 2026

View reviewed changes

mergify Bot added documentation Improvements or additions to documentation v1 labels May 6, 2026

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 6, 2026

View reviewed changes

ZJY0516 added 2 commits May 6, 2026 08:56

update

3c5a184

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

update

0891716

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

idonati mentioned this pull request May 9, 2026

[Bug]: DeepSeek-V4-Flash hangs after ~6 requests with cudagraph_mode=FULL_AND_PIECEWISE + chunked prefill on SM 12.x (GB10) #40969

Open

1 task

idonati mentioned this pull request May 9, 2026

[Bug]: Xiaomi MiMo v2.5 broken on SM12x #41519

Closed

1 task

amd-satre mentioned this pull request May 11, 2026

[models] MiMo V2: Pro fused-QKV FP8 loader + fix SWA wrong-data on V2.5 base #42270

Open

mgoin reviewed Jun 11, 2026

View reviewed changes

Comment thread vllm/v1/attention/backends/triton_attn_diffkv.py Outdated

Comment thread vllm/v1/attention/ops/triton_unified_attention_diffkv.py

Comment thread vllm/v1/attention/ops/triton_unified_attention_diffkv.py

ZJY0516 added 2 commits June 11, 2026 12:48

Merge branch 'main' into triton-diff-kv

ce3ab80

update

5e54468

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

ZJY0516 requested review from AndreasKaratzas, Harry-Chen, WoosukKwon, khluu, tlrmchlsmth, yewentao256 and zyongye as code owners June 11, 2026 13:23

ZJY0516 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 11, 2026

mergify Bot added the ci/build label Jun 11, 2026

update

ab10add

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

mgoin approved these changes Jun 11, 2026

View reviewed changes

mgoin merged commit f81daf8 into vllm-project:main Jun 11, 2026
106 checks passed

ZJY0516 deleted the triton-diff-kv branch June 11, 2026 15:53

Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026

[Attention] add triton diff-kv backend for mimo (vllm-project#41797)

806b2cd

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026

[Attention] add triton diff-kv backend for mimo (vllm-project#41797)

8ae6436

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026

[Attention] add triton diff-kv backend for mimo (vllm-project#41797)

46db873

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[Attention] add triton diff-kv backend for mimo (vllm-project#41797)

e9e5a7a

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

A1c0r-Z mentioned this pull request Jun 25, 2026

[Bugfix][Model] MiMo-V2: support TP > num_kv_heads for the fused FP8 QKV projection #46755

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Attention] add triton diff-kv backend for mimo#41797

[Attention] add triton diff-kv backend for mimo#41797
mgoin merged 6 commits into
vllm-project:mainfrom
ZJY0516:triton-diff-kv

ZJY0516 commented May 6, 2026 •

edited

Loading

claude Bot left a comment

mergify Bot commented May 6, 2026

gemini-code-assist Bot left a comment

gemini-code-assist Bot May 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

chatgpt-codex-connector Bot May 6, 2026

idonati commented May 9, 2026

idonati commented May 9, 2026

idonati commented May 16, 2026

mgoin left a comment

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Jun 11, 2026

mgoin left a comment

Uh oh!

Labels

3 participants

Uh oh!

Uh oh!

Conversation

ZJY0516 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

mergify Bot commented May 6, 2026

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

chatgpt-codex-connector Bot May 6, 2026

Choose a reason for hiding this comment

idonati commented May 9, 2026

What works (confirmed before runtime test)

Practical blocker for users running NVFP4 quants

A small suggestion

On deck

idonati commented May 9, 2026

Boot trace + error

Shape math

Practical workarounds for users

Net impact

idonati commented May 16, 2026

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Jun 11, 2026

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants

ZJY0516 commented May 6, 2026 •

edited

Loading