Skip to content

[Kernel][Perf] Tune fused_moe FP8 config for Qwen3-Next-80B tp=4 on H100 (+25% at batch 96-512)#44830

Merged
ZJY0516 merged 3 commits into
vllm-project:mainfrom
qyYue1389:tune-qwen3-next-fp8-h100
Jun 9, 2026
Merged

[Kernel][Perf] Tune fused_moe FP8 config for Qwen3-Next-80B tp=4 on H100 (+25% at batch 96-512)#44830
ZJY0516 merged 3 commits into
vllm-project:mainfrom
qyYue1389:tune-qwen3-next-fp8-h100

Conversation

@qyYue1389

@qyYue1389 qyYue1389 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Purpose

Add a tuned Triton fused_moe config for Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 on a single NVIDIA H100 80GB HBM3 with tp=4 and FP8 w8a8 blockwise quantization (block_shape=[128, 128])

Qwen3-Next-80B has num_experts=512, num_experts_per_tok=10, moe_intermediate_size=512. With tp=4 this gives the kernel shape E=512, N=128. The current get_default_config in fused_moe.py is ~25% slower than the tuned configuration for the batch_size range that dominates production serving

The file contains 5 batch-size keys (96, 128, 256, 512, 1024) for which the tuned config beats the default by >=5%. Other batch sizes fall back to get_default_config via the existing vllm fallback path

Test plan

Tuned with benchmark_moe.py --tune on a single H100 80GB HBM3 (precompiled vLLM cu128 + torch 2.11.0+cu128 + triton 3.6.0)

Test Result

Tuning walked 10,368 valid configurations across 18 batch sizes ({1,2,4,8,16,24,32,48,64,96,128,256,512,1024,1536,2048,3072,4096}), total wall time 77 minutes.

After tuning, the file was copied into the configs/ directory and the benchmark re-run for all 18 batch sizes. Per-batch-size median kernel time:

batch_size baseline (us) tuned (us) speedup
1 29.45 29.26 +0.6%
2 30.35 30.13 +0.7%
4 36.38 35.24 +3.2%
8 54.81 54.93 -0.2%
16 80.74 79.29 +1.8%
24 95.90 96.65 -0.8%
32 111.83 109.90 +1.8%
48 140.51 140.38 +0.1%
64 152.18 151.30 +0.6%
96 215.09 171.56 +25.4%
128 229.10 183.06 +25.1%
256 248.41 199.60 +24.5%
512 262.77 217.58 +20.8%
1024 290.12 268.59 +8.0%
1536 318.21 317.86 +0.1%
2048 345.23 344.86 +0.1%
3072 429.06 429.53 -0.1%
4096 568.93 567.50 +0.3%

The JSON ships only the bolded 5 keys (96, 128, 256, 512, 1024). Batch sizes outside this range fall back to get_default_config, which my benchmark shows is within 1% of the tuned config

The wins on the kept keys are primarily driven by GROUP_SIZE_M=64 (vs get_default_config, GROUP_SIZE_M=32), which gives a more L2-cache friendly tile traversal order for E=512, N=128, FP8 workload

The closest H100 PR is #35808 (E=256,N=512 on H100), which is in different shape. Recent tuned-config PRs #44273 / #44152 / #44553 all target H20, not H100. No open or merged PR addresses
E=512,N=128 FP8 blockwise on H100.

Reproduction commands

# Baseline (move the new JSON aside first to force the default config)
mv vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128,128].json /tmp/
.venv/bin/python benchmarks/kernels/benchmark_moe.py \
    --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
    -tp 4 --dtype fp8_w8a8 \
    --batch-size 1 2 4 8 16 24 32 48 64 96 128 256 512 1024 1536 2048 3072 4096

# Tuned (restore the JSON)
mv /tmp/E=512,N=128,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128,128].json \
   vllm/model_executor/layers/fused_moe/configs/
.venv/bin/python benchmarks/kernels/benchmark_moe.py \
    --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
    -tp 4 --dtype fp8_w8a8 \
    --batch-size 1 2 4 8 16 24 32 48 64 96 128 256 512 1024 1536 2048 3072 4096
Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the qwen Related to Qwen models label Jun 8, 2026
@qyYue1389 qyYue1389 marked this pull request as draft June 8, 2026 04:52
@qyYue1389 qyYue1389 marked this pull request as ready for review June 8, 2026 04:58
@ZJY0516 ZJY0516 enabled auto-merge (squash) June 9, 2026 04:05
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2026
@ZJY0516 ZJY0516 merged commit 59401ac into vllm-project:main Jun 9, 2026
77 checks passed
ekagra-ranjan pushed a commit to ekagra-ranjan/vllm that referenced this pull request Jun 9, 2026
…100 (+25% at batch 96-512) (vllm-project#44830)

Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026
…100 (+25% at batch 96-512) (vllm-project#44830)

Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>
Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
…100 (+25% at batch 96-512) (vllm-project#44830)

Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>
vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026
…100 (+25% at batch 96-512) (vllm-project#44830)

Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…100 (+25% at batch 96-512) (vllm-project#44830)

Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…100 (+25% at batch 96-512) (vllm-project#44830)

Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…100 (+25% at batch 96-512) (vllm-project#44830)

Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>
ohsono pushed a commit to ohsono/vllm that referenced this pull request Jul 3, 2026
…100 (+25% at batch 96-512) (vllm-project#44830)

Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

2 participants