[WideEP] Integrate DeepEP v2#41183
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for the DeepEP v2 (ElasticBuffer) all2all backend, including a new DeepEPV2PrepareAndFinalize implementation for MoE kernels, a dedicated All2AllManager, and associated configuration and environment variables. A comprehensive test suite for DeepEP v2 MoE is also introduced. Feedback identifies a critical issue where a strict bfloat16 assertion in the finalization logic would cause crashes for float16 models, recommending a cast to bfloat16 instead to maintain compatibility.
Add a new `deepep_v2` all2all backend that uses the DeepEP v2 ElasticBuffer API (NCCL GIN backend). This provides a unified dispatch/combine interface that works for both intra-node and inter-node expert parallelism with analytical SM calculation. Key changes: - New DeepEPV2PrepareAndFinalize class using do_expand=True for per-expert-contiguous layout with weighted reduction in combine - DeepEPV2All2AllManager with ElasticBuffer handle caching and theoretical SM calculation via get_theoretical_num_sms() - NCCL >= 4.30.4 version gating in has_deep_ep_v2() since the GIN backend requires a newer NCCL than PyTorch typically bundles - FP8 block-quantized dispatch support - DBO (micro-batching) support with async prepare/finalize - Environment variables: VLLM_DEEPEP_V2_ALLOW_HYBRID_MODE, VLLM_DEEPEP_V2_PREFER_OVERLAP, VLLM_DEEPEP_V2_ALLOW_MULTIPLE_REDUCTION - Update DeepEP install script to pin v2.0 release (b306af06af) - Comprehensive multi-process test suite Usage: --all2all-backend=deepep_v2 --enable-expert-parallel Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
48622cb to
a2a4b00
Compare
use_fp8_dispatch requires the ElasticBuffer to receive FP8 input. In production, this is ensured by pre-quantizing via moe_kernel_quantize_input when is_block_quantized=True. The test was parametrizing use_fp8_dispatch independently of dtype, allowing bf16 input with use_fp8_dispatch=True which triggers a buffer size assertion in DeepEP v2. Fix: - Derive use_fp8_dispatch from dtype (True only for FP8 weights) - Add block_shape=[128, 128] to quant config for FP8 to enable the block quantization path that pre-quantizes input Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Test DeepEPV2All2AllManager init, ElasticBuffer handle creation and caching, SM calculation, and destroy/re-create cycle. Skipped when DeepEP v2 or NCCL >= 4.30.4 is not available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
torch.cuda.nccl.version() returns the compile-time NCCL version baked into the PyTorch wheel, not the runtime library. Use ctypes to load the actual libnccl.so and call ncclGetVersion() directly, which respects VLLM_NCCL_SO_PATH and LD_LIBRARY_PATH. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
5f78797 to
75149ae
Compare
- Remove explicit two-stream DBO switching (dbo_yield_and_switch_*), use synchronous dispatch/combine (async_with_compute_stream=False). The ElasticBuffer handles comm internally on its comm_stream. - Switch from do_expand=True to do_expand=False for cudagraph compat. do_expand=True requires do_cpu_sync=True (CPU polling loop) which can't be captured in a cudagraph. do_expand=False with do_cpu_sync=False is fully capturable. - Handle worst-case padding from do_cpu_sync=False: use handle.psum_num_recv_tokens_per_scaleup_rank to get real token count, zero out padding rows in recv_x, recv_topk_weights, and expert_x_scale. - Add explicitly_destroy=True to ElasticBuffer creation in all2all.py. - Add cudagraph capture/replay unit test (test_deep_ep_v2_moe_cudagraph). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Document the four key design decisions (do_expand=False, do_cpu_sync=False, async_with_compute_stream=False, expert_tokens_meta=None) and why each is necessary for cudagraph + DBO compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
|
Hi @tlrmchlsmth, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Prefill (use_cudagraph=False): do_expand=True + do_cpu_sync=True — exact memory allocation, per-expert-contiguous layout. Saves GPU memory for large batches. Decode (use_cudagraph=True): do_expand=False + do_cpu_sync=False — worst-case allocation, scattered layout. Fully cudagraph-capturable. Mode selected based on enforce_eager config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
|
Hi @tlrmchlsmth, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| input_ids: torch.Tensor | None = None, | ||
| ) -> tuple[torch.Tensor, torch.Tensor]: | ||
| """Compute routing using fused top-k with bias.""" | ||
| # The topk kernel dispatches dtype based on topk_ids (set by |
There was a problem hiding this comment.
Is this logic only applicable to this router?
There was a problem hiding this comment.
This is a targeted fix that I ran into for the hash routing layers (first few layers of DSv4)
I'm inclined to leave this here as a special case but do you think we should generalize to other routers?
There was a problem hiding this comment.
I'm fine with a spot fix but I think it would be simple enough to move the code to base_router.py.
Would type mismatches lead to a crash for other routers if the fix stays here?
bnellnm
left a comment
There was a problem hiding this comment.
LGTM. Just had a few questions.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> # Conflicts: # vllm/model_executor/layers/fused_moe/layer.py
|
Hi @tlrmchlsmth, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
The topk kernel dispatches on topk_ids dtype and assumes input_tokens/hash_indices_table match. Move the cast from the FusedTopKBiasRouter method into fused_topk_bias() so any caller gets it automatically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
|
Hi @tlrmchlsmth, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
- Fix invalid kwarg `intermediate_size_per_partition` → `intermediate_size` in test_deepep_v2_moe.py - Replace `FusedMoE.make_expert_params_mapping()` with the standalone `fused_moe_make_expert_params_mapping()` in XPU model/mtp (FusedMoE is now a function, not a class) - Collapse multi-line if to single line in fused_topk_bias_router.py Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> # Conflicts: # vllm/models/deepseek_v4/xpu/mtp.py
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: divineearthly <divineearthly@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Integrating deepseek-ai/DeepEP#605
Unit tests are passing, and it is working e2e. GSM8k appears to be good and will follow up with more thorough e2e tests.
Notes:
uv pip install "nvidia-nccl-cu13>=2.30.4"I'm using this Containerfile for now https://github.com/tlrmchlsmth/j-llm-d/blob/deepep-v2-dev-image/dev/Containerfile.deepep-v2