[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1 by tomeras91 · Pull Request #40398 · vllm-project/vllm

tomeras91 · 2026-04-20T19:58:40Z

Summary

RayExecutorV2 names its TP worker actors as vllm_Worker_{instance_id}[_TP{n}] (see vllm/v1/executor/ray_utils.py::build_actor_name). When data_parallel_size > 1, CoreEngineActorManager.__init__ produces each DP engine's VllmConfig via copy.deepcopy(vllm_config), which preserves the original instance_id across all DP replicas.
With a single shared instance_id, every DP engine attempts to create Ray actors with the same names and all but the first crash with:
```
ray.exceptions.ActorAlreadyExistsError: Actor with name
'vllm_Worker_<id>_TP0' already exists in the namespace ...
```
Fix: append the global DP rank to instance_id in each per-engine config copy, matching the existing precedent that does the same for kv_transfer_config.engine_id in the same function. Gated on dp_size > 1 so single-DP deployments are unaffected.

Bug only reproduces when VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1 (added in #36836); the legacy RayDistributedExecutor doesn't use named Ray actors and is unaffected.

Test plan

Reproduced on Nemotron-Super NVFP4, TP=2, DP=32, 16-node GB200 cluster with `VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1` + `VLLM_RAY_DP_PACK_STRATEGY=strict` — server previously crashed during actor creation with `ActorAlreadyExistsError`.
With this patch, all 32 DP engines start and the server serves requests normally.
No behavior change when DP=1 (guarded by `if dp_size > 1`).
Existing unit tests in `tests/distributed/test_ray_v2_executor*.py` still pass.

AI assistance disclosure

AI assistance was used to audit `instance_id` usage across the codebase and draft the patch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request updates the engine configuration in vllm/v1/engine/utils.py to append the DP rank to the instance_id, ensuring unique identifiers for Ray actors across data-parallel replicas. While this addresses the initial startup, feedback indicates that similar logic is missing in the elastic EP scale-up path and the multiprocessing DP path, which could still result in naming collisions or incorrect KV transfer behavior.

gemini-code-assist · 2026-04-20T20:00:13Z

+            if dp_size > 1:
+                # Append the DP rank to instance_id so that per-engine
+                # identifiers (e.g. Ray actor names in RayExecutorV2) are
+                # unique across DP replicas.
+                dp_vllm_config.instance_id = f"{dp_vllm_config.instance_id}_dp{index}"


The fix correctly addresses the actor name collision for the initial Ray DP startup. However, the same logic appears to be missing in two other critical locations where per-engine configurations are initialized:

Elastic EP Scale-up: In CoreEngineActorManager.scale_up_elastic_ep (around line 766), new engines are launched but their instance_id is not updated with the new DP rank. Additionally, the kv_transfer_config.engine_id update (present in __init__ at line 399) is also missing here. This will cause collisions and incorrect behavior when scaling up a cluster using RayExecutorV2 or KV transfer.

Multiprocessing DP Path: In vllm/v1/engine/core.py::run_engine_core (around line 1083), the vllm_config is modified for kv_transfer_config, but instance_id is not updated. If data_parallel_backend="mp" is used in conjunction with RayExecutorV2, collisions will occur.

To ensure full coverage and consistency, please apply the instance_id update in these locations as well, using the global DP rank (rank and dp_rank respectively). You should also fix the missing kv_transfer_config update in scale_up_elastic_ep.

jeffreywang88 · 2026-04-30T23:18:27Z

Thanks for the fix!

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Cherry-pick 62 bugfix/security PRs from upstream vllm-project/vllm main (2026-05-03 to 2026-06-17), covering scheduler, engine core, model runner, worker, attention, KV cache, compilation, and structured output fixes. Security (4): vllm-project#43286 vllm-project#44744 vllm-project#45118 vllm-project#45252 Bugfix (56): vllm-project#35536 vllm-project#36616 vllm-project#38895 vllm-project#39155 vllm-project#39324 vllm-project#39562 vllm-project#39805 vllm-project#40398 vllm-project#40726 vllm-project#40727 vllm-project#40737 vllm-project#40749 vllm-project#40961 vllm-project#41119 vllm-project#41133 vllm-project#41233 vllm-project#41237 vllm-project#41411 vllm-project#41496 vllm-project#41549 vllm-project#41674 vllm-project#41873 vllm-project#41895 vllm-project#42040 vllm-project#42112 vllm-project#42289 vllm-project#42479 vllm-project#42585 vllm-project#42692 vllm-project#42706 vllm-project#42709 vllm-project#42739 vllm-project#42967 vllm-project#43001 vllm-project#43079 vllm-project#43125 vllm-project#43160 vllm-project#43616 vllm-project#43669 vllm-project#43719 vllm-project#43768 vllm-project#43808 vllm-project#43961 vllm-project#43982 vllm-project#43988 vllm-project#43998 vllm-project#44057 vllm-project#44560 vllm-project#44574 vllm-project#44568 vllm-project#44603 vllm-project#44744 vllm-project#45195 vllm-project#45345 vllm-project#45383 vllm-project#45487 vllm-project#45564 vllm-project#45673 Runner fix (2): vllm-project#44568 vllm-project#44603 Skipped: vllm-project#43781 (ROCm-specific, not applicable to Ascend NPU) Conflict resolutions: - Manual merge: vllm-project#43286 vllm-project#45118 vllm-project#42112 vllm-project#43160 vllm-project#43719 vllm-project#44560 - Upstream-preferred (-X theirs): vllm-project#43808 vllm-project#43988 vllm-project#42967 vllm-project#35536 vllm-project#45195 - Test files (--theirs): vllm-project#44744 vllm-project#41895 vllm-project#42040 vllm-project#41233 vllm-project#45345 vllm-project#43982 Co-authored-by: GitHub Copilot Signed-off-by: MingqiWang-coder <mingqiwang@hust.edu.cn>

Cherry-pick 62 bugfix/security PRs from upstream vllm-project/vllm main (2026-05-03 to 2026-06-17), covering scheduler, engine core, model runner, worker, attention, KV cache, compilation, and structured output fixes. Security (4): vllm-project#43286 vllm-project#44744 vllm-project#45118 vllm-project#45252 Bugfix (56): vllm-project#35536 vllm-project#36616 vllm-project#38895 vllm-project#39155 vllm-project#39324 vllm-project#39562 vllm-project#39805 vllm-project#40398 vllm-project#40726 Runner fix (2): vllm-project#44568 vllm-project#44603 Skipped: vllm-project#43781 (ROCm-specific, not applicable to Ascend NPU) Conflict resolutions: - Manual merge: vllm-project#43286 vllm-project#45118 vllm-project#42112 vllm-project#43160 vllm-project#43719 vllm-project#44560 - Upstream-preferred (-X theirs): vllm-project#43808 vllm-project#43988 vllm-project#42967 vllm-project#35536 vllm-project#45195 - Test files (--theirs): vllm-project#44744 vllm-project#41895 vllm-project#42040 vllm-project#41233 vllm-project#45345 vllm-project#43982 Co-authored-by: GitHub Copilot Signed-off-by: MingqiWang-coder <mingqiwang@hust.edu.cn>

Fix RayExecutorV2 actor name collision with DP > 1

5048dcb

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tomeras91 requested a review from njhill as a code owner April 20, 2026 19:58

claude Bot reviewed Apr 20, 2026

View reviewed changes

mergify Bot added v1 bug Something isn't working labels Apr 20, 2026

gemini-code-assist Bot reviewed Apr 20, 2026

View reviewed changes

jeffreywang88 approved these changes Apr 30, 2026

View reviewed changes

tomeras91 added the ready ONLY add when PR is ready to merge/full CI is needed label May 3, 2026

Merge branch 'main' into fix-ray-v2-dp-instance-id-collision

8e3c9d6

tomeras91 enabled auto-merge (squash) May 3, 2026 09:50

mgoin approved these changes May 3, 2026

View reviewed changes

vllm-bot merged commit cb03fee into vllm-project:main May 3, 2026
52 of 54 checks passed

tomeras91 deleted the fix-ray-v2-dp-instance-id-collision branch May 3, 2026 20:22

chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request May 6, 2026

[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1 (vll…

bc18774

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026

[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1 (vll…

12e06df

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026

[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1 (vll…

7b154ab

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1 (vll…

a49930f

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026

[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1 (vll…

7185705

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

MingqiWang-coder mentioned this pull request Jul 1, 2026

[Sync] Upstream V1 engine core — 89 PRs (bugfix, scheduler, runner, worker, hardware) vLLM-HUST/vllm-hust#82

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1#40398

[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1#40398
vllm-bot merged 2 commits into
vllm-project:mainfrom
tomeras91:fix-ray-v2-dp-instance-id-collision

tomeras91 commented Apr 20, 2026 •

edited

Loading

claude Bot left a comment

gemini-code-assist Bot left a comment

gemini-code-assist Bot Apr 20, 2026

jeffreywang88 commented Apr 30, 2026

Uh oh!

Labels

4 participants

Uh oh!

Uh oh!

Conversation

tomeras91 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

AI assistance disclosure

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

jeffreywang88 commented Apr 30, 2026

Uh oh!

Labels

4 participants

tomeras91 commented Apr 20, 2026 •

edited

Loading