[Model] Add MiniMax M3 support#45381
Conversation
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: youkaichao <youkaichao@gmail.com>
|
Documentation preview: https://vllm--45381.org.readthedocs.build/en/45381/ |
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate; vllm_bench 8k-in/1k-out at conc-128 (MI355X-family baseline). KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate; vllm_bench 8k-in/1k-out at conc-128 (MI355X-family baseline). KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate; vllm_bench 8k-in/1k-out at conc-128 (MI355X-family baseline). KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
…odel_config VllmConfig may have model_config=None (e.g. backend-selector tests), which made get_supported_kernel_block_sizes() raise AttributeError. Fall back to the base [16, 32, 64] sizes when model_config is unavailable. AI assistance (Claude) was used for this change. Co-authored-by: Claude Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Yongye Zhu <yongye@inferact.ai> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
…nector
Test recipe for vLLM v1's native CPU KV-cache offloading connector on
MiniMax-M3 MXFP8 (H200), using the agentic-coding scenario (Claude Code trace
replay via aiperf inferencex-agentx-mvp) at a single large concurrency.
Config (nvidia-master.yaml minimaxm3-fp8-h200-vllm-agentic):
TEP8 (TP8 + expert parallel), offloading: cpu, conc 64, duration 1800s,
on the day-zero vllm/vllm-openai:minimax-m3 image.
New script benchmarks/single_node/agentic/minimaxm3_fp8_h200.sh, modeled on the
M2.5 H200 agentic sibling with M3-specific serve flags:
- --block-size 128 (mandatory for MSA sparse attention)
- --language-model-only (text-only; frees VRAM for KV)
- BF16 KV (no --kv-cache-dtype fp8: MXFP8 lacks calibrated KV scales and fp8
KV corrupts output, vllm-project/vllm#45381)
- prefix caching ENABLED (coding traces share large prefixes; offloading that
cache to CPU is the point of the test)
- CPU offload via vLLM native connector: --kv-offloading-backend native
--kv-offloading-size $TOTAL_CPU_DRAM_GB --disable-hybrid-kv-cache-manager
(TOTAL_CPU_DRAM_GB default 600), same path as the M2.5 H200 agentic recipe
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
@youkaichao The model is not working without these two PRs: |
| "VLLM_MXFP8_EMULATION_DEQUANT_AT_LOAD": lambda: ( | ||
| os.getenv("VLLM_MXFP8_EMULATION_DEQUANT_AT_LOAD", "True").lower() | ||
| in ("true", "1") | ||
| ), |
There was a problem hiding this comment.
In case a performant mixed-precision BF16-MXFP8 GEMM/MOE is not available and memory overhead (~doubling memory occupied by model weights here), it can make sense to dequantize ahead of time, hence why it was proposed previously in #35855 for MXFP4/MXFP6 (but rejected back then).
cc @mgoin fyi
I think the code fragmentation between MXFP4/MXFP6/MXFP8 emulation logic could be reduced in the future
e.g. https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/experts/mxfp8_emulation_moe.py and https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/experts/ocp_mx_emulation_moe.py
Summary
Duplicate-work check
MiniMax M3andminimax_m3found no duplicates. BroaderM3 modelresults were unrelated.FIX #45360
Tests
cargo fmt --manifest-path rust/Cargo.toml --all -- --checkcargo test --manifest-path rust/Cargo.toml -p vllm-reasoning-parser -p vllm-tool-parser -p vllm-chatNotes