Skip to content

[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1#40398

Merged
vllm-bot merged 2 commits into
vllm-project:mainfrom
tomeras91:fix-ray-v2-dp-instance-id-collision
May 3, 2026
Merged

[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1#40398
vllm-bot merged 2 commits into
vllm-project:mainfrom
tomeras91:fix-ray-v2-dp-instance-id-collision

Conversation

@tomeras91

@tomeras91 tomeras91 commented Apr 20, 2026

Copy link
Copy Markdown
Member

Summary

  • RayExecutorV2 names its TP worker actors as vllm_Worker_{instance_id}[_TP{n}] (see vllm/v1/executor/ray_utils.py::build_actor_name). When data_parallel_size > 1, CoreEngineActorManager.__init__ produces each DP engine's VllmConfig via copy.deepcopy(vllm_config), which preserves the original instance_id across all DP replicas.
  • With a single shared instance_id, every DP engine attempts to create Ray actors with the same names and all but the first crash with:
    ray.exceptions.ActorAlreadyExistsError: Actor with name
    'vllm_Worker_<id>_TP0' already exists in the namespace ...
    
  • Fix: append the global DP rank to instance_id in each per-engine config copy, matching the existing precedent that does the same for kv_transfer_config.engine_id in the same function. Gated on dp_size > 1 so single-DP deployments are unaffected.

Bug only reproduces when VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1 (added in #36836); the legacy RayDistributedExecutor doesn't use named Ray actors and is unaffected.

Test plan

  • Reproduced on Nemotron-Super NVFP4, TP=2, DP=32, 16-node GB200 cluster with `VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1` + `VLLM_RAY_DP_PACK_STRATEGY=strict` — server previously crashed during actor creation with `ActorAlreadyExistsError`.
  • With this patch, all 32 DP engines start and the server serves requests normally.
  • No behavior change when DP=1 (guarded by `if dp_size > 1`).
  • Existing unit tests in `tests/distributed/test_ray_v2_executor*.py` still pass.

AI assistance disclosure

AI assistance was used to audit `instance_id` usage across the codebase and draft the patch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
@tomeras91 tomeras91 requested a review from njhill as a code owner April 20, 2026 19:58

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added v1 bug Something isn't working labels Apr 20, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the engine configuration in vllm/v1/engine/utils.py to append the DP rank to the instance_id, ensuring unique identifiers for Ray actors across data-parallel replicas. While this addresses the initial startup, feedback indicates that similar logic is missing in the elastic EP scale-up path and the multiprocessing DP path, which could still result in naming collisions or incorrect KV transfer behavior.

Comment thread vllm/v1/engine/utils.py
Comment on lines +391 to +395
if dp_size > 1:
# Append the DP rank to instance_id so that per-engine
# identifiers (e.g. Ray actor names in RayExecutorV2) are
# unique across DP replicas.
dp_vllm_config.instance_id = f"{dp_vllm_config.instance_id}_dp{index}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The fix correctly addresses the actor name collision for the initial Ray DP startup. However, the same logic appears to be missing in two other critical locations where per-engine configurations are initialized:

  1. Elastic EP Scale-up: In CoreEngineActorManager.scale_up_elastic_ep (around line 766), new engines are launched but their instance_id is not updated with the new DP rank. Additionally, the kv_transfer_config.engine_id update (present in __init__ at line 399) is also missing here. This will cause collisions and incorrect behavior when scaling up a cluster using RayExecutorV2 or KV transfer.
  2. Multiprocessing DP Path: In vllm/v1/engine/core.py::run_engine_core (around line 1083), the vllm_config is modified for kv_transfer_config, but instance_id is not updated. If data_parallel_backend="mp" is used in conjunction with RayExecutorV2, collisions will occur.

To ensure full coverage and consistency, please apply the instance_id update in these locations as well, using the global DP rank (rank and dp_rank respectively). You should also fix the missing kv_transfer_config update in scale_up_elastic_ep.

@jeffreywang88

Copy link
Copy Markdown
Contributor

Thanks for the fix!

@tomeras91 tomeras91 added the ready ONLY add when PR is ready to merge/full CI is needed label May 3, 2026
@tomeras91 tomeras91 enabled auto-merge (squash) May 3, 2026 09:50
@vllm-bot vllm-bot merged commit cb03fee into vllm-project:main May 3, 2026
52 of 54 checks passed
@tomeras91 tomeras91 deleted the fix-ray-v2-dp-instance-id-collision branch May 3, 2026 20:22
joa-stdn pushed a commit to joa-stdn/vllm that referenced this pull request May 4, 2026
…m-project#40398)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Joachim Studnia <joachim@mistral.ai>
chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request May 6, 2026
…m-project#40398)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Copilot AI pushed a commit to hongbolv/vllm that referenced this pull request May 7, 2026
…m-project#40398)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
ikaadil pushed a commit to ikaadil/vllm that referenced this pull request May 7, 2026
…m-project#40398)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
…m-project#40398)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
…m-project#40398)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…m-project#40398)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…m-project#40398)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…m-project#40398)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
MingqiWang-coder added a commit to vLLM-HUST/vllm-hust that referenced this pull request Jun 30, 2026
Cherry-pick 62 bugfix/security PRs from upstream vllm-project/vllm main
(2026-05-03 to 2026-06-17), covering scheduler, engine core, model runner,
worker, attention, KV cache, compilation, and structured output fixes.

Security (4): vllm-project#43286 vllm-project#44744 vllm-project#45118 vllm-project#45252
Bugfix (56): vllm-project#35536 vllm-project#36616 vllm-project#38895 vllm-project#39155 vllm-project#39324 vllm-project#39562 vllm-project#39805 vllm-project#40398 vllm-project#40726
vllm-project#40727 vllm-project#40737 vllm-project#40749 vllm-project#40961 vllm-project#41119 vllm-project#41133 vllm-project#41233 vllm-project#41237 vllm-project#41411 vllm-project#41496 vllm-project#41549
vllm-project#41674 vllm-project#41873 vllm-project#41895 vllm-project#42040 vllm-project#42112 vllm-project#42289 vllm-project#42479 vllm-project#42585 vllm-project#42692 vllm-project#42706 vllm-project#42709
vllm-project#42739 vllm-project#42967 vllm-project#43001 vllm-project#43079 vllm-project#43125 vllm-project#43160 vllm-project#43616 vllm-project#43669 vllm-project#43719 vllm-project#43768 vllm-project#43808
vllm-project#43961 vllm-project#43982 vllm-project#43988 vllm-project#43998 vllm-project#44057 vllm-project#44560 vllm-project#44574 vllm-project#44568 vllm-project#44603 vllm-project#44744 vllm-project#45195
vllm-project#45345 vllm-project#45383 vllm-project#45487 vllm-project#45564 vllm-project#45673
Runner fix (2): vllm-project#44568 vllm-project#44603

Skipped: vllm-project#43781 (ROCm-specific, not applicable to Ascend NPU)

Conflict resolutions:
- Manual merge: vllm-project#43286 vllm-project#45118 vllm-project#42112 vllm-project#43160 vllm-project#43719 vllm-project#44560
- Upstream-preferred (-X theirs): vllm-project#43808 vllm-project#43988 vllm-project#42967 vllm-project#35536 vllm-project#45195
- Test files (--theirs): vllm-project#44744 vllm-project#41895 vllm-project#42040 vllm-project#41233 vllm-project#45345 vllm-project#43982

Co-authored-by: GitHub Copilot
Signed-off-by: MingqiWang-coder <mingqiwang@hust.edu.cn>
MingqiWang-coder added a commit to vLLM-HUST/vllm-hust that referenced this pull request Jun 30, 2026
Cherry-pick 62 bugfix/security PRs from upstream vllm-project/vllm main
(2026-05-03 to 2026-06-17), covering scheduler, engine core, model runner,
worker, attention, KV cache, compilation, and structured output fixes.

Security (4): vllm-project#43286 vllm-project#44744 vllm-project#45118 vllm-project#45252
Bugfix (56): vllm-project#35536 vllm-project#36616 vllm-project#38895 vllm-project#39155 vllm-project#39324 vllm-project#39562 vllm-project#39805 vllm-project#40398 vllm-project#40726
vllm-project#40727 vllm-project#40737 vllm-project#40749 vllm-project#40961 vllm-project#41119 vllm-project#41133 vllm-project#41233 vllm-project#41237 vllm-project#41411 vllm-project#41496 vllm-project#41549
vllm-project#41674 vllm-project#41873 vllm-project#41895 vllm-project#42040 vllm-project#42112 vllm-project#42289 vllm-project#42479 vllm-project#42585 vllm-project#42692 vllm-project#42706 vllm-project#42709
vllm-project#42739 vllm-project#42967 vllm-project#43001 vllm-project#43079 vllm-project#43125 vllm-project#43160 vllm-project#43616 vllm-project#43669 vllm-project#43719 vllm-project#43768 vllm-project#43808
vllm-project#43961 vllm-project#43982 vllm-project#43988 vllm-project#43998 vllm-project#44057 vllm-project#44560 vllm-project#44574 vllm-project#44568 vllm-project#44603 vllm-project#44744 vllm-project#45195
vllm-project#45345 vllm-project#45383 vllm-project#45487 vllm-project#45564 vllm-project#45673
Runner fix (2): vllm-project#44568 vllm-project#44603

Skipped: vllm-project#43781 (ROCm-specific, not applicable to Ascend NPU)

Conflict resolutions:
- Manual merge: vllm-project#43286 vllm-project#45118 vllm-project#42112 vllm-project#43160 vllm-project#43719 vllm-project#44560
- Upstream-preferred (-X theirs): vllm-project#43808 vllm-project#43988 vllm-project#42967 vllm-project#35536 vllm-project#45195
- Test files (--theirs): vllm-project#44744 vllm-project#41895 vllm-project#42040 vllm-project#41233 vllm-project#45345 vllm-project#43982

Co-authored-by: GitHub Copilot
Signed-off-by: MingqiWang-coder <mingqiwang@hust.edu.cn>
MingqiWang-coder added a commit to vLLM-HUST/vllm-hust that referenced this pull request Jul 2, 2026
Cherry-pick 62 bugfix/security PRs from upstream vllm-project/vllm main
(2026-05-03 to 2026-06-17), covering scheduler, engine core, model runner,
worker, attention, KV cache, compilation, and structured output fixes.

Security (4): vllm-project#43286 vllm-project#44744 vllm-project#45118 vllm-project#45252
Bugfix (56): vllm-project#35536 vllm-project#36616 vllm-project#38895 vllm-project#39155 vllm-project#39324 vllm-project#39562 vllm-project#39805 vllm-project#40398 vllm-project#40726
Runner fix (2): vllm-project#44568 vllm-project#44603

Skipped: vllm-project#43781 (ROCm-specific, not applicable to Ascend NPU)

Conflict resolutions:
- Manual merge: vllm-project#43286 vllm-project#45118 vllm-project#42112 vllm-project#43160 vllm-project#43719 vllm-project#44560
- Upstream-preferred (-X theirs): vllm-project#43808 vllm-project#43988 vllm-project#42967 vllm-project#35536 vllm-project#45195
- Test files (--theirs): vllm-project#44744 vllm-project#41895 vllm-project#42040 vllm-project#41233 vllm-project#45345 vllm-project#43982

Co-authored-by: GitHub Copilot
Signed-off-by: MingqiWang-coder <mingqiwang@hust.edu.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

4 participants