[Kernel][Perf] Tune fused_moe FP8 config for Qwen3-Next-80B tp=4 on H100 (+25% at batch 96-512) by qyYue1389 · Pull Request #44830 · vllm-project/vllm

qyYue1389 · 2026-06-08T04:51:22Z

Purpose

Add a tuned Triton fused_moe config for Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 on a single NVIDIA H100 80GB HBM3 with tp=4 and FP8 w8a8 blockwise quantization (block_shape=[128, 128])

Qwen3-Next-80B has num_experts=512, num_experts_per_tok=10, moe_intermediate_size=512. With tp=4 this gives the kernel shape E=512, N=128. The current get_default_config in fused_moe.py is ~25% slower than the tuned configuration for the batch_size range that dominates production serving

The file contains 5 batch-size keys (96, 128, 256, 512, 1024) for which the tuned config beats the default by >=5%. Other batch sizes fall back to get_default_config via the existing vllm fallback path

Test plan

Tuned with benchmark_moe.py --tune on a single H100 80GB HBM3 (precompiled vLLM cu128 + torch 2.11.0+cu128 + triton 3.6.0)

Test Result

Tuning walked 10,368 valid configurations across 18 batch sizes ({1,2,4,8,16,24,32,48,64,96,128,256,512,1024,1536,2048,3072,4096}), total wall time 77 minutes.

After tuning, the file was copied into the configs/ directory and the benchmark re-run for all 18 batch sizes. Per-batch-size median kernel time:

batch_size	baseline (us)	tuned (us)	speedup
1	29.45	29.26	+0.6%
2	30.35	30.13	+0.7%
4	36.38	35.24	+3.2%
8	54.81	54.93	-0.2%
16	80.74	79.29	+1.8%
24	95.90	96.65	-0.8%
32	111.83	109.90	+1.8%
48	140.51	140.38	+0.1%
64	152.18	151.30	+0.6%
96	215.09	171.56	+25.4%
128	229.10	183.06	+25.1%
256	248.41	199.60	+24.5%
512	262.77	217.58	+20.8%
1024	290.12	268.59	+8.0%
1536	318.21	317.86	+0.1%
2048	345.23	344.86	+0.1%
3072	429.06	429.53	-0.1%
4096	568.93	567.50	+0.3%

The JSON ships only the bolded 5 keys (96, 128, 256, 512, 1024). Batch sizes outside this range fall back to get_default_config, which my benchmark shows is within 1% of the tuned config

The wins on the kept keys are primarily driven by GROUP_SIZE_M=64 (vs get_default_config, GROUP_SIZE_M=32), which gives a more L2-cache friendly tile traversal order for E=512, N=128, FP8 workload

The closest H100 PR is #35808 (E=256,N=512 on H100), which is in different shape. Recent tuned-config PRs #44273 / #44152 / #44553 all target H20, not H100. No open or merged PR addresses
E=512,N=128 FP8 blockwise on H100.

Reproduction commands

# Baseline (move the new JSON aside first to force the default config)
mv vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128,128].json /tmp/
.venv/bin/python benchmarks/kernels/benchmark_moe.py \
    --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
    -tp 4 --dtype fp8_w8a8 \
    --batch-size 1 2 4 8 16 24 32 48 64 96 128 256 512 1024 1536 2048 3072 4096

# Tuned (restore the JSON)
mv /tmp/E=512,N=128,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128,128].json \
   vllm/model_executor/layers/fused_moe/configs/
.venv/bin/python benchmarks/kernels/benchmark_moe.py \
    --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
    -tp 4 --dtype fp8_w8a8 \
    --batch-size 1 2 4 8 16 24 32 48 64 96 128 256 512 1024 1536 2048 3072 4096

Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

…100 (+25% at batch 96-512) (vllm-project#44830) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

…100 (+25% at batch 96-512) (vllm-project#44830) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

…100 (+25% at batch 96-512) (vllm-project#44830) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>

…100 (+25% at batch 96-512) (vllm-project#44830) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

…100 (+25% at batch 96-512) (vllm-project#44830) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>

Tune fused_moe FP8 blockwise config for Qwen3-Next-80B tp=4 on H100

3f57ce7

Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>

qyYue1389 requested review from mgoin, pavanimajety and zyongye as code owners June 8, 2026 04:51

claude Bot reviewed Jun 8, 2026

View reviewed changes

mergify Bot added the qwen Related to Qwen models label Jun 8, 2026

qyYue1389 marked this pull request as draft June 8, 2026 04:52

qyYue1389 marked this pull request as ready for review June 8, 2026 04:58

ZJY0516 approved these changes Jun 9, 2026

View reviewed changes

ZJY0516 enabled auto-merge (squash) June 9, 2026 04:05

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2026

ZJY0516 added 2 commits June 9, 2026 12:06

Merge branch 'main' into tune-qwen3-next-fp8-h100

468599d

Merge branch 'main' into tune-qwen3-next-fp8-h100

726f361

ZJY0516 merged commit 59401ac into vllm-project:main Jun 9, 2026
77 checks passed

Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026

[Kernel][Perf] Tune fused_moe FP8 config for Qwen3-Next-80B tp=4 on H…

0dc370b

…100 (+25% at batch 96-512) (vllm-project#44830) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>

vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026

[Kernel][Perf] Tune fused_moe FP8 config for Qwen3-Next-80B tp=4 on H…

7eb6a73

…100 (+25% at batch 96-512) (vllm-project#44830) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026

[Kernel][Perf] Tune fused_moe FP8 config for Qwen3-Next-80B tp=4 on H…

6dde4ca

…100 (+25% at batch 96-512) (vllm-project#44830) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[Kernel][Perf] Tune fused_moe FP8 config for Qwen3-Next-80B tp=4 on H…

52f3184

…100 (+25% at batch 96-512) (vllm-project#44830) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>

ohsono pushed a commit to ohsono/vllm that referenced this pull request Jul 3, 2026

[Kernel][Perf] Tune fused_moe FP8 config for Qwen3-Next-80B tp=4 on H…

8b5a0c8

…100 (+25% at batch 96-512) (vllm-project#44830) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel][Perf] Tune fused_moe FP8 config for Qwen3-Next-80B tp=4 on H100 (+25% at batch 96-512)#44830

[Kernel][Perf] Tune fused_moe FP8 config for Qwen3-Next-80B tp=4 on H100 (+25% at batch 96-512)#44830
ZJY0516 merged 3 commits into
vllm-project:mainfrom
qyYue1389:tune-qwen3-next-fp8-h100

qyYue1389 commented Jun 8, 2026 •

edited

Loading

claude Bot left a comment

Uh oh!

Labels

2 participants

Uh oh!

Uh oh!

Conversation

qyYue1389 commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test plan

Test Result

Reproduction commands

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Labels

2 participants

qyYue1389 commented Jun 8, 2026 •

edited

Loading