[DeepEP V2] Bound num_max_tokens_per_rank in do_expand=False by WoosukKwon · Pull Request #46404 · vllm-project/vllm

WoosukKwon · 2026-06-22T19:22:27Z

Purpose

In do_expand=False (decode / cudagraph) mode, DeepEPV2PrepareAndFinalize._do_dispatch left num_max_tokens_per_rank unset, so ElasticBuffer.dispatch fell back to the buffer's init capacity (= max_num_batched_tokens). The recv buffer was therefore sized to the worst case R * max_num_batched_tokens, and the expert kernels processed ~R * 8192 rows even when only a handful of decode tokens were present — which dominated decode step time.

This PR bounds num_max_tokens_per_rank to the DP-padded batch size (max(num_tokens_across_dp), uniform across ranks), rounded up to a power of two.

The pow2 bucketing is important: DeepEP JIT-compiles a separate dispatch kernel per distinct num_max_tokens_per_rank. Feeding the raw per-step size makes it recompile for every batch size — a cicc recompile storm that starves the GPU at high concurrency. Rounding up to a power of two bounds the compiled set to ~log2(max_num_batched_tokens) values (compiled once, then cached), while staying small for decode (e.g. 1 token → 1) and capped at the buffer's init capacity for prefill.

Prefill (do_expand=True) is unchanged — it keeps the existing num_max_tokens_per_rank=None / CPU-sync path.

Not a duplicate

Searched open PRs for num_max_tokens_per_rank, DeepEPV2, DeepEP v2, and deepep_v2. The related open DeepEP v2 PRs — #45282 (NVFP4 dispatch), #40718 (combine_v2), #45193 (topk_ids optional for do_expand), #45321 (Dockerfile version bump) — address different areas. None touch decode-mode recv-buffer sizing.

Test plan

Lint/type: ruff check, ruff format, and mypy (via pre-commit) pass on the touched file.
Benchmarked DeepSeek-V4-Flash on GB200 ×4, dp=4 -ep, DeepEP v2 + Triton MoE:
- concurrency=1 decode step time: 100.7 ms → 16.4 ms (6.1×)
- concurrency=64: no cicc recompile storm (previously ~0% GPU utilization during the storm)
- concurrency=1024: ~+6% throughput
- GSM8K accuracy unchanged (~95%)

Note

AI assistance (Claude) was used in preparing this change.

Signed-off-by: Woosuk Kwon woosuk@inferact.ai

In do_expand=False (decode/cudagraph) mode, DeepEPV2 dispatch left num_max_tokens_per_rank unset, so the recv buffer defaulted to the buffer's init capacity (= max_num_batched_tokens). The experts then processed ~R * 8192 rows even for a handful of decode tokens, which dominated decode step time. Bound it to the DP-padded batch size (max(num_tokens_across_dp), uniform across ranks), rounded up to a power of 2. The pow2 bucketing keeps DeepEP's per-size dispatch-kernel JIT to a small bounded set of values (compiled once, then cached) instead of recompiling for every per-step size, which otherwise causes a cicc recompile storm that starves the GPU at high concurrency. Measured on DeepSeek-V4-Flash, GB200 x4, dp=4 -ep: concurrency=1 decode step time 100.7ms -> 16.4ms (6.1x), no recompile storm at concurrency=64, GSM8K accuracy unchanged. Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Co-Authored-By: Roy Wang <jasonailu87@gmail.com> Co-Authored-By: gnovack <novackgm@gmail.com> Co-Authored-By: Claude <noreply@anthropic.com>

…oject#46404) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Co-authored-by: Roy Wang <jasonailu87@gmail.com> Co-authored-by: gnovack <novackgm@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

…oject#46404) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Co-authored-by: Roy Wang <jasonailu87@gmail.com> Co-authored-by: gnovack <novackgm@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

WoosukKwon requested review from mgoin, pavanimajety and zyongye as code owners June 22, 2026 19:22

mergify Bot added the bug Something isn't working label Jun 22, 2026

WoosukKwon force-pushed the woosuk/deepep-v2-bound-num-max branch from 42691d1 to 55feb4b Compare June 22, 2026 19:25

WoosukKwon requested a review from esmeetu June 22, 2026 19:28

WoosukKwon changed the title ~~[Bugfix] Bound DeepEPV2 num_max_tokens_per_rank in decode mode~~ Jun 22, 2026

esmeetu approved these changes Jun 23, 2026

View reviewed changes

esmeetu added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 23, 2026

Merge branch 'main' into woosuk/deepep-v2-bound-num-max

3a9fdcf

WoosukKwon merged commit e485920 into main Jun 23, 2026
25 of 93 checks passed

WoosukKwon deleted the woosuk/deepep-v2-bound-num-max branch June 23, 2026 01:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[DeepEP V2] Bound num_max_tokens_per_rank in do_expand=False#46404

[DeepEP V2] Bound num_max_tokens_per_rank in do_expand=False#46404
WoosukKwon merged 2 commits into
mainfrom
woosuk/deepep-v2-bound-num-max

WoosukKwon commented Jun 22, 2026

Uh oh!

Labels

2 participants

Uh oh!

Uh oh!

Conversation

WoosukKwon commented Jun 22, 2026

Purpose

Not a duplicate

Test plan

Note

Uh oh!

Labels

2 participants