Skip to content

[WideEP] Integrate DeepEP v2#41183

Merged
WoosukKwon merged 34 commits into
vllm-project:mainfrom
tlrmchlsmth:deepep-v2-integration
Jun 9, 2026
Merged

[WideEP] Integrate DeepEP v2#41183
WoosukKwon merged 34 commits into
vllm-project:mainfrom
tlrmchlsmth:deepep-v2-integration

Conversation

@tlrmchlsmth

@tlrmchlsmth tlrmchlsmth commented Apr 29, 2026

Copy link
Copy Markdown
Member

Integrating deepseek-ai/DeepEP#605

Unit tests are passing, and it is working e2e. GSM8k appears to be good and will follow up with more thorough e2e tests.

Notes:

  • I couldn't get this working on an 8xB200 system as DeepEP v2's ElasticBuffer unconditionally asserts NCCL GIN availability even for intra-node NVLink-only. This is a TODO.
  • This requires NCCL 2.30.4, and PyTorch pins to NCCL 2.28.9, so for now this requires users to manually install NCCL after installing torch via uv pip install "nvidia-nccl-cu13>=2.30.4"

I'm using this Containerfile for now https://github.com/tlrmchlsmth/j-llm-d/blob/deepep-v2-dev-image/dev/Containerfile.deepep-v2

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the DeepEP v2 (ElasticBuffer) all2all backend, including a new DeepEPV2PrepareAndFinalize implementation for MoE kernels, a dedicated All2AllManager, and associated configuration and environment variables. A comprehensive test suite for DeepEP v2 MoE is also introduced. Feedback identifies a critical issue where a strict bfloat16 assertion in the finalization logic would cause crashes for float16 models, recommending a cast to bfloat16 instead to maintain compatibility.

Comment thread vllm/model_executor/layers/fused_moe/prepare_finalize/deepep_v2.py Outdated
Add a new `deepep_v2` all2all backend that uses the DeepEP v2
ElasticBuffer API (NCCL GIN backend). This provides a unified
dispatch/combine interface that works for both intra-node and
inter-node expert parallelism with analytical SM calculation.

Key changes:
- New DeepEPV2PrepareAndFinalize class using do_expand=True for
  per-expert-contiguous layout with weighted reduction in combine
- DeepEPV2All2AllManager with ElasticBuffer handle caching and
  theoretical SM calculation via get_theoretical_num_sms()
- NCCL >= 4.30.4 version gating in has_deep_ep_v2() since the
  GIN backend requires a newer NCCL than PyTorch typically bundles
- FP8 block-quantized dispatch support
- DBO (micro-batching) support with async prepare/finalize
- Environment variables: VLLM_DEEPEP_V2_ALLOW_HYBRID_MODE,
  VLLM_DEEPEP_V2_PREFER_OVERLAP, VLLM_DEEPEP_V2_ALLOW_MULTIPLE_REDUCTION
- Update DeepEP install script to pin v2.0 release (b306af06af)
- Comprehensive multi-process test suite

Usage: --all2all-backend=deepep_v2 --enable-expert-parallel

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

tlrmchlsmth and others added 4 commits April 29, 2026 22:56
use_fp8_dispatch requires the ElasticBuffer to receive FP8 input.
In production, this is ensured by pre-quantizing via
moe_kernel_quantize_input when is_block_quantized=True.

The test was parametrizing use_fp8_dispatch independently of dtype,
allowing bf16 input with use_fp8_dispatch=True which triggers a
buffer size assertion in DeepEP v2.

Fix:
- Derive use_fp8_dispatch from dtype (True only for FP8 weights)
- Add block_shape=[128, 128] to quant config for FP8 to enable
  the block quantization path that pre-quantizes input

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Test DeepEPV2All2AllManager init, ElasticBuffer handle creation
and caching, SM calculation, and destroy/re-create cycle.
Skipped when DeepEP v2 or NCCL >= 4.30.4 is not available.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
torch.cuda.nccl.version() returns the compile-time NCCL version
baked into the PyTorch wheel, not the runtime library. Use ctypes
to load the actual libnccl.so and call ncclGetVersion() directly,
which respects VLLM_NCCL_SO_PATH and LD_LIBRARY_PATH.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
@tlrmchlsmth tlrmchlsmth force-pushed the deepep-v2-integration branch from 5f78797 to 75149ae Compare April 30, 2026 02:59
Comment thread vllm/model_executor/layers/fused_moe/prepare_finalize/deepep_v2.py Outdated
Comment thread tests/kernels/moe/test_deepep_v2_moe.py Outdated
Comment thread tests/kernels/moe/test_deepep_v2_moe.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/prepare_finalize/deepep_v2.py
tlrmchlsmth and others added 2 commits April 30, 2026 21:51
- Remove explicit two-stream DBO switching (dbo_yield_and_switch_*),
  use synchronous dispatch/combine (async_with_compute_stream=False).
  The ElasticBuffer handles comm internally on its comm_stream.
- Switch from do_expand=True to do_expand=False for cudagraph compat.
  do_expand=True requires do_cpu_sync=True (CPU polling loop) which
  can't be captured in a cudagraph. do_expand=False with do_cpu_sync=False
  is fully capturable.
- Handle worst-case padding from do_cpu_sync=False: use
  handle.psum_num_recv_tokens_per_scaleup_rank to get real token count,
  zero out padding rows in recv_x, recv_topk_weights, and expert_x_scale.
- Add explicitly_destroy=True to ElasticBuffer creation in all2all.py.
- Add cudagraph capture/replay unit test (test_deep_ep_v2_moe_cudagraph).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Document the four key design decisions (do_expand=False,
do_cpu_sync=False, async_with_compute_stream=False,
expert_tokens_meta=None) and why each is necessary for
cudagraph + DBO compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
@mergify

mergify Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Hi @tlrmchlsmth, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
Prefill (use_cudagraph=False):
  do_expand=True + do_cpu_sync=True — exact memory allocation,
  per-expert-contiguous layout. Saves GPU memory for large batches.

Decode (use_cudagraph=True):
  do_expand=False + do_cpu_sync=False — worst-case allocation,
  scattered layout. Fully cudagraph-capturable.

Mode selected based on enforce_eager config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
@mergify

mergify Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Hi @tlrmchlsmth, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
input_ids: torch.Tensor | None = None,
) -> tuple[torch.Tensor, torch.Tensor]:
"""Compute routing using fused top-k with bias."""
# The topk kernel dispatches dtype based on topk_ids (set by

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this logic only applicable to this router?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a targeted fix that I ran into for the hash routing layers (first few layers of DSv4)

I'm inclined to leave this here as a special case but do you think we should generalize to other routers?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with a spot fix but I think it would be simple enough to move the code to base_router.py.

Would type mismatches lead to a crash for other routers if the fix stays here?

@bnellnm bnellnm left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just had a few questions.

 - Rev DeepEP
 - Envs changes
 - Disable DBO for deepepv2 (it doesn't work yet)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 5, 2026
@mergify

mergify Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 8, 2026
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

# Conflicts:
#	vllm/model_executor/layers/fused_moe/layer.py
@mergify mergify Bot removed the needs-rebase label Jun 8, 2026
@mergify

mergify Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Hi @tlrmchlsmth, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

The topk kernel dispatches on topk_ids dtype and assumes
input_tokens/hash_indices_table match. Move the cast from the
FusedTopKBiasRouter method into fused_topk_bias() so any caller
gets it automatically.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
@mergify

mergify Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Hi @tlrmchlsmth, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

tlrmchlsmth and others added 2 commits June 8, 2026 16:48
- Fix invalid kwarg `intermediate_size_per_partition` → `intermediate_size`
  in test_deepep_v2_moe.py
- Replace `FusedMoE.make_expert_params_mapping()` with the standalone
  `fused_moe_make_expert_params_mapping()` in XPU model/mtp (FusedMoE is
  now a function, not a class)
- Collapse multi-line if to single line in fused_topk_bias_router.py

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

# Conflicts:
#	vllm/models/deepseek_v4/xpu/mtp.py
@WoosukKwon WoosukKwon merged commit e2f993d into vllm-project:main Jun 9, 2026
96 of 97 checks passed
@github-project-automation github-project-automation Bot moved this to Done in NVIDIA Jun 9, 2026
ekagra-ranjan pushed a commit to ekagra-ranjan/vllm that referenced this pull request Jun 9, 2026
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ohsono pushed a commit to ohsono/vllm that referenced this pull request Jul 3, 2026
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

6 participants