Skip to content

[Model] Add MiniMax M3 support#45381

Merged
youkaichao merged 28 commits into
mainfrom
m3_release
Jun 15, 2026
Merged

[Model] Add MiniMax M3 support#45381
youkaichao merged 28 commits into
mainfrom
m3_release

Conversation

@youkaichao

@youkaichao youkaichao commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

  • Add MiniMax M3 model support across config, processors, model registry, AMD/NVIDIA model implementations, MTP, sparse attention, and warmup paths.
  • Add MiniMax M3 reasoning and tool parsers, including Rust frontend registrations and Python-facing parser wrappers.
  • Add supporting kernels, quantization paths, router GEMM shape support, and targeted tests.

Duplicate-work check

  • Open PR searches for MiniMax M3 and minimax_m3 found no duplicates. Broader M3 model results were unrelated.

FIX #45360

Tests

  • cargo fmt --manifest-path rust/Cargo.toml --all -- --check
  • cargo test --manifest-path rust/Cargo.toml -p vllm-reasoning-parser -p vllm-tool-parser -p vllm-chat

Notes

  • AI assistance was used to prepare this one-commit release branch and resolve conflicts against current main.
Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
@mergify

mergify Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor
functionstackx added a commit to functionstackx/perf-eval that referenced this pull request Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4),
TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter
Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env
VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate;
vllm_bench 8k-in/1k-out at conc-128 (MI355X-family baseline).

KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales,
so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562).

nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries
the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server
bring-up until then.

AI-assisted (Claude Code).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
functionstackx added a commit to functionstackx/perf-eval that referenced this pull request Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4),
TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter
Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env
VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate;
vllm_bench 8k-in/1k-out at conc-128 (MI355X-family baseline).

KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales,
so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562).

nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries
the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server
bring-up until then.

AI-assisted (Claude Code).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
functionstackx added a commit to functionstackx/perf-eval that referenced this pull request Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4),
TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter
Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env
VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate;
vllm_bench 8k-in/1k-out at conc-128 (MI355X-family baseline).

KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales,
so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562).

nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries
the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server
bring-up until then.

AI-assisted (Claude Code).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
…odel_config

VllmConfig may have model_config=None (e.g. backend-selector tests), which
made get_supported_kernel_block_sizes() raise AttributeError. Fall back to
the base [16, 32, 64] sizes when model_config is unavailable.

AI assistance (Claude) was used for this change.

Co-authored-by: Claude
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Yongye Zhu <yongye@inferact.ai>

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
functionstackx added a commit to SemiAnalysisAI/InferenceX that referenced this pull request Jun 15, 2026
…nector

Test recipe for vLLM v1's native CPU KV-cache offloading connector on
MiniMax-M3 MXFP8 (H200), using the agentic-coding scenario (Claude Code trace
replay via aiperf inferencex-agentx-mvp) at a single large concurrency.

Config (nvidia-master.yaml minimaxm3-fp8-h200-vllm-agentic):
  TEP8 (TP8 + expert parallel), offloading: cpu, conc 64, duration 1800s,
  on the day-zero vllm/vllm-openai:minimax-m3 image.

New script benchmarks/single_node/agentic/minimaxm3_fp8_h200.sh, modeled on the
M2.5 H200 agentic sibling with M3-specific serve flags:
  - --block-size 128 (mandatory for MSA sparse attention)
  - --language-model-only (text-only; frees VRAM for KV)
  - BF16 KV (no --kv-cache-dtype fp8: MXFP8 lacks calibrated KV scales and fp8
    KV corrupts output, vllm-project/vllm#45381)
  - prefix caching ENABLED (coding traces share large prefixes; offloading that
    cache to CPU is the point of the test)
  - CPU offload via vLLM native connector: --kv-offloading-backend native
    --kv-offloading-size $TOTAL_CPU_DRAM_GB --disable-hybrid-kv-cache-manager
    (TOTAL_CPU_DRAM_GB default 600), same path as the M2.5 H200 agentic recipe

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@gau-nernst gau-nernst mentioned this pull request Jun 15, 2026
8 tasks
Comment thread vllm/envs.py
"VLLM_MXFP8_EMULATION_DEQUANT_AT_LOAD": lambda: (
os.getenv("VLLM_MXFP8_EMULATION_DEQUANT_AT_LOAD", "True").lower()
in ("true", "1")
),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case a performant mixed-precision BF16-MXFP8 GEMM/MOE is not available and memory overhead (~doubling memory occupied by model weights here), it can make sense to dequantize ahead of time, hence why it was proposed previously in #35855 for MXFP4/MXFP6 (but rejected back then).

cc @mgoin fyi

I think the code fragmentation between MXFP4/MXFP6/MXFP8 emulation logic could be reduced in the future

e.g. https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/experts/mxfp8_emulation_moe.py and https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/experts/ocp_mx_emulation_moe.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation gpt-oss Related to GPT-OSS models multi-modality Related to multi-modality (#4194) new-model Requests to new models nvidia ready ONLY add when PR is ready to merge/full CI is needed rust speculative-decoding tool-calling v1