[Model] Add MiniMax M3 support by youkaichao · Pull Request #45381 · vllm-project/vllm

youkaichao · 2026-06-12T07:39:34Z

Summary

Add MiniMax M3 model support across config, processors, model registry, AMD/NVIDIA model implementations, MTP, sparse attention, and warmup paths.
Add MiniMax M3 reasoning and tool parsers, including Rust frontend registrations and Python-facing parser wrappers.
Add supporting kernels, quantization paths, router GEMM shape support, and targeted tests.

Duplicate-work check

Open PR searches for MiniMax M3 and minimax_m3 found no duplicates. Broader M3 model results were unrelated.

FIX #45360

Tests

cargo fmt --manifest-path rust/Cargo.toml --all -- --check
cargo test --manifest-path rust/Cargo.toml -p vllm-reasoning-parser -p vllm-tool-parser -p vllm-chat

Notes

AI assistance was used to prepare this one-commit release branch and resolve conflicts against current main.

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: youkaichao <youkaichao@gmail.com>

mergify · 2026-06-12T07:40:13Z

Documentation preview: https://vllm--45381.org.readthedocs.build/en/45381/

Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate; vllm_bench 8k-in/1k-out at conc-128 (MI355X-family baseline). KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>

…odel_config VllmConfig may have model_config=None (e.g. backend-selector tests), which made get_supported_kernel_block_sizes() raise AttributeError. Fall back to the base [16, 32, 64] sizes when model_config is unavailable. AI assistance (Claude) was used for this change. Co-authored-by: Claude Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Yongye Zhu <yongye@inferact.ai> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

…nector Test recipe for vLLM v1's native CPU KV-cache offloading connector on MiniMax-M3 MXFP8 (H200), using the agentic-coding scenario (Claude Code trace replay via aiperf inferencex-agentx-mvp) at a single large concurrency. Config (nvidia-master.yaml minimaxm3-fp8-h200-vllm-agentic): TEP8 (TP8 + expert parallel), offloading: cpu, conc 64, duration 1800s, on the day-zero vllm/vllm-openai:minimax-m3 image. New script benchmarks/single_node/agentic/minimaxm3_fp8_h200.sh, modeled on the M2.5 H200 agentic sibling with M3-specific serve flags: - --block-size 128 (mandatory for MSA sparse attention) - --language-model-only (text-only; frees VRAM for KV) - BF16 KV (no --kv-cache-dtype fp8: MXFP8 lacks calibrated KV scales and fp8 KV corrupts output, vllm-project/vllm#45381) - prefix caching ENABLED (coding traces share large prefixes; offloading that cache to CPU is the point of the test) - CPU offload via vLLM native connector: --kv-offloading-backend native --kv-offloading-size $TOTAL_CPU_DRAM_GB --disable-hybrid-kv-cache-manager (TOTAL_CPU_DRAM_GB default 600), same path as the M2.5 H200 agentic recipe Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

gaby · 2026-06-15T22:48:52Z

@youkaichao The model is not working without these two PRs:

[BugFix] Fix MXFP8 checkpoint loading when quant_method collides with online shorthand #45582
[Bugfix] Parse MiniMax M3 streaming reasoning by text markers #45718

fxmarty-amd · 2026-06-24T14:29:43Z

+    "VLLM_MXFP8_EMULATION_DEQUANT_AT_LOAD": lambda: (
+        os.getenv("VLLM_MXFP8_EMULATION_DEQUANT_AT_LOAD", "True").lower()
+        in ("true", "1")
+    ),


In case a performant mixed-precision BF16-MXFP8 GEMM/MOE is not available and memory overhead (~doubling memory occupied by model weights here), it can make sense to dequantize ahead of time, hence why it was proposed previously in #35855 for MXFP4/MXFP6 (but rejected back then).

cc @mgoin fyi

I think the code fragmentation between MXFP4/MXFP6/MXFP8 emulation logic could be reduced in the future

e.g. https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/experts/mxfp8_emulation_moe.py and https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/experts/ocp_mx_emulation_moe.py

[Model] Add MiniMax M3 support

c983460

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: youkaichao <youkaichao@gmail.com>

functionstackx mentioned this pull request Jun 15, 2026

[Klaud Cold] minimaxm3-fp8-h200-vllm-agentic: test CPU KV-offload connector on M3 (agentx, single large conc 64) SemiAnalysisAI/InferenceX#1763

Closed

Merge branch 'main' into m3_release

b53c2cf

gau-nernst mentioned this pull request Jun 15, 2026

[Roadmap] Minimax M3 #45668

Open

8 tasks

Merge branch 'main' into m3_release

91e8f14

tjtanaa mentioned this pull request Jun 15, 2026

[ROCm][Quant] Minmax-M3: enable fp8_per_channel and fix SwiGLU-OAI fp8 MoE for bf16 weights on mi300x #45590

Closed

4 tasks

Merge branch 'main' into m3_release

37fe719

youkaichao merged commit 0a1c503 into main Jun 15, 2026
190 of 196 checks passed

github-project-automation Bot moved this to Done in Tool Calling Jun 15, 2026

github-project-automation Bot moved this from Ready to Done in gpt-oss Issues & Enhancements Jun 15, 2026

github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 15, 2026

This was referenced Jun 15, 2026

[Bugfix] Parse MiniMax M3 streaming reasoning by text markers #45718

Merged

[Bug]: Minimax m3 reasoning parser sending <mm:think> in content field in streaming #45687

Closed

cquil11 mentioned this pull request Jun 15, 2026

[Bugfix][ROCm] Fix MiniMax-M3 FP8 KV cache dtype #45720

Merged

JustinTong0323 mentioned this pull request Jun 15, 2026

Fix MiniMax-M3 allreduce fusion correctness JustinTong0323/sglang#40

Open

soaringk mentioned this pull request Jun 16, 2026

[Model][MiniMax-M3] Add pipeline parallelism support #45810

Merged

fxmarty-amd reviewed Jun 24, 2026

View reviewed changes

timothystewart6 mentioned this pull request Jun 25, 2026

feat: bump VLLM_REF past vllm-project/vllm#45381 to enable MiniMax M3 support timothystewart6/vllm-gb10#30

Closed

matdou mentioned this pull request Jun 27, 2026

[Kernel]: _flash_attn_fwd() got an unexpected keyword argument 'dynamic_causal' #46914

Open

timothystewart6 mentioned this pull request Jun 30, 2026

chore(deps): bump vLLM v0.24.0, uv 0.11.26, apt snapshot 2026-06-30 timothystewart6/vllm-gb10#35

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Model] Add MiniMax M3 support#45381

[Model] Add MiniMax M3 support#45381
youkaichao merged 28 commits into
mainfrom
m3_release

youkaichao commented Jun 12, 2026 •

edited by jeejeelee

Loading

mergify Bot commented Jun 12, 2026

Uh oh!

gaby commented Jun 15, 2026

fxmarty-amd Jun 24, 2026

Labels

18 participants

Uh oh!

Uh oh!

Conversation

youkaichao commented Jun 12, 2026 • edited by jeejeelee Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Duplicate-work check

Tests

Notes

mergify Bot commented Jun 12, 2026

Uh oh!

gaby commented Jun 15, 2026

fxmarty-amd Jun 24, 2026

Choose a reason for hiding this comment

Labels

18 participants

youkaichao commented Jun 12, 2026 •

edited by jeejeelee

Loading