Skip to content

[DeepEP V2] Bound num_max_tokens_per_rank in do_expand=False#46404

Merged
WoosukKwon merged 2 commits into
mainfrom
woosuk/deepep-v2-bound-num-max
Jun 23, 2026
Merged

[DeepEP V2] Bound num_max_tokens_per_rank in do_expand=False#46404
WoosukKwon merged 2 commits into
mainfrom
woosuk/deepep-v2-bound-num-max

Conversation

@WoosukKwon

Copy link
Copy Markdown
Collaborator

Purpose

In do_expand=False (decode / cudagraph) mode, DeepEPV2PrepareAndFinalize._do_dispatch left num_max_tokens_per_rank unset, so ElasticBuffer.dispatch fell back to the buffer's init capacity (= max_num_batched_tokens). The recv buffer was therefore sized to the worst case R * max_num_batched_tokens, and the expert kernels processed ~R * 8192 rows even when only a handful of decode tokens were present — which dominated decode step time.

This PR bounds num_max_tokens_per_rank to the DP-padded batch size (max(num_tokens_across_dp), uniform across ranks), rounded up to a power of two.

The pow2 bucketing is important: DeepEP JIT-compiles a separate dispatch kernel per distinct num_max_tokens_per_rank. Feeding the raw per-step size makes it recompile for every batch size — a cicc recompile storm that starves the GPU at high concurrency. Rounding up to a power of two bounds the compiled set to ~log2(max_num_batched_tokens) values (compiled once, then cached), while staying small for decode (e.g. 1 token → 1) and capped at the buffer's init capacity for prefill.

Prefill (do_expand=True) is unchanged — it keeps the existing num_max_tokens_per_rank=None / CPU-sync path.

Not a duplicate

Searched open PRs for num_max_tokens_per_rank, DeepEPV2, DeepEP v2, and deepep_v2. The related open DeepEP v2 PRs — #45282 (NVFP4 dispatch), #40718 (combine_v2), #45193 (topk_ids optional for do_expand), #45321 (Dockerfile version bump) — address different areas. None touch decode-mode recv-buffer sizing.

Test plan

  • Lint/type: ruff check, ruff format, and mypy (via pre-commit) pass on the touched file.
  • Benchmarked DeepSeek-V4-Flash on GB200 ×4, dp=4 -ep, DeepEP v2 + Triton MoE:
    • concurrency=1 decode step time: 100.7 ms → 16.4 ms (6.1×)
    • concurrency=64: no cicc recompile storm (previously ~0% GPU utilization during the storm)
    • concurrency=1024: ~+6% throughput
    • GSM8K accuracy unchanged (~95%)

Note

AI assistance (Claude) was used in preparing this change.

Signed-off-by: Woosuk Kwon woosuk@inferact.ai

@mergify mergify Bot added the bug Something isn't working label Jun 22, 2026
In do_expand=False (decode/cudagraph) mode, DeepEPV2 dispatch left
num_max_tokens_per_rank unset, so the recv buffer defaulted to the
buffer's init capacity (= max_num_batched_tokens). The experts then
processed ~R * 8192 rows even for a handful of decode tokens, which
dominated decode step time.

Bound it to the DP-padded batch size (max(num_tokens_across_dp), uniform
across ranks), rounded up to a power of 2. The pow2 bucketing keeps
DeepEP's per-size dispatch-kernel JIT to a small bounded set of values
(compiled once, then cached) instead of recompiling for every per-step
size, which otherwise causes a cicc recompile storm that starves the GPU
at high concurrency.

Measured on DeepSeek-V4-Flash, GB200 x4, dp=4 -ep: concurrency=1 decode
step time 100.7ms -> 16.4ms (6.1x), no recompile storm at concurrency=64,
GSM8K accuracy unchanged.

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Co-Authored-By: Roy Wang <jasonailu87@gmail.com>
Co-Authored-By: gnovack <novackgm@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
@WoosukKwon WoosukKwon force-pushed the woosuk/deepep-v2-bound-num-max branch from 42691d1 to 55feb4b Compare June 22, 2026 19:25
@WoosukKwon WoosukKwon requested a review from esmeetu June 22, 2026 19:28
@WoosukKwon WoosukKwon changed the title [Bugfix] Bound DeepEPV2 num_max_tokens_per_rank in decode mode Jun 22, 2026
@esmeetu esmeetu added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 23, 2026
@WoosukKwon WoosukKwon merged commit e485920 into main Jun 23, 2026
25 of 93 checks passed
@WoosukKwon WoosukKwon deleted the woosuk/deepep-v2-bound-num-max branch June 23, 2026 01:14
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…oject#46404)

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Co-authored-by: Roy Wang <jasonailu87@gmail.com>
Co-authored-by: gnovack <novackgm@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
…oject#46404)

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Co-authored-by: Roy Wang <jasonailu87@gmail.com>
Co-authored-by: gnovack <novackgm@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

2 participants