[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1#40398
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request updates the engine configuration in vllm/v1/engine/utils.py to append the DP rank to the instance_id, ensuring unique identifiers for Ray actors across data-parallel replicas. While this addresses the initial startup, feedback indicates that similar logic is missing in the elastic EP scale-up path and the multiprocessing DP path, which could still result in naming collisions or incorrect KV transfer behavior.
| if dp_size > 1: | ||
| # Append the DP rank to instance_id so that per-engine | ||
| # identifiers (e.g. Ray actor names in RayExecutorV2) are | ||
| # unique across DP replicas. | ||
| dp_vllm_config.instance_id = f"{dp_vllm_config.instance_id}_dp{index}" |
There was a problem hiding this comment.
The fix correctly addresses the actor name collision for the initial Ray DP startup. However, the same logic appears to be missing in two other critical locations where per-engine configurations are initialized:
- Elastic EP Scale-up: In
CoreEngineActorManager.scale_up_elastic_ep(around line 766), new engines are launched but theirinstance_idis not updated with the new DP rank. Additionally, thekv_transfer_config.engine_idupdate (present in__init__at line 399) is also missing here. This will cause collisions and incorrect behavior when scaling up a cluster usingRayExecutorV2or KV transfer. - Multiprocessing DP Path: In
vllm/v1/engine/core.py::run_engine_core(around line 1083), thevllm_configis modified forkv_transfer_config, butinstance_idis not updated. Ifdata_parallel_backend="mp"is used in conjunction withRayExecutorV2, collisions will occur.
To ensure full coverage and consistency, please apply the instance_id update in these locations as well, using the global DP rank (rank and dp_rank respectively). You should also fix the missing kv_transfer_config update in scale_up_elastic_ep.
|
Thanks for the fix! |
…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>
…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Cherry-pick 62 bugfix/security PRs from upstream vllm-project/vllm main (2026-05-03 to 2026-06-17), covering scheduler, engine core, model runner, worker, attention, KV cache, compilation, and structured output fixes. Security (4): vllm-project#43286 vllm-project#44744 vllm-project#45118 vllm-project#45252 Bugfix (56): vllm-project#35536 vllm-project#36616 vllm-project#38895 vllm-project#39155 vllm-project#39324 vllm-project#39562 vllm-project#39805 vllm-project#40398 vllm-project#40726 vllm-project#40727 vllm-project#40737 vllm-project#40749 vllm-project#40961 vllm-project#41119 vllm-project#41133 vllm-project#41233 vllm-project#41237 vllm-project#41411 vllm-project#41496 vllm-project#41549 vllm-project#41674 vllm-project#41873 vllm-project#41895 vllm-project#42040 vllm-project#42112 vllm-project#42289 vllm-project#42479 vllm-project#42585 vllm-project#42692 vllm-project#42706 vllm-project#42709 vllm-project#42739 vllm-project#42967 vllm-project#43001 vllm-project#43079 vllm-project#43125 vllm-project#43160 vllm-project#43616 vllm-project#43669 vllm-project#43719 vllm-project#43768 vllm-project#43808 vllm-project#43961 vllm-project#43982 vllm-project#43988 vllm-project#43998 vllm-project#44057 vllm-project#44560 vllm-project#44574 vllm-project#44568 vllm-project#44603 vllm-project#44744 vllm-project#45195 vllm-project#45345 vllm-project#45383 vllm-project#45487 vllm-project#45564 vllm-project#45673 Runner fix (2): vllm-project#44568 vllm-project#44603 Skipped: vllm-project#43781 (ROCm-specific, not applicable to Ascend NPU) Conflict resolutions: - Manual merge: vllm-project#43286 vllm-project#45118 vllm-project#42112 vllm-project#43160 vllm-project#43719 vllm-project#44560 - Upstream-preferred (-X theirs): vllm-project#43808 vllm-project#43988 vllm-project#42967 vllm-project#35536 vllm-project#45195 - Test files (--theirs): vllm-project#44744 vllm-project#41895 vllm-project#42040 vllm-project#41233 vllm-project#45345 vllm-project#43982 Co-authored-by: GitHub Copilot Signed-off-by: MingqiWang-coder <mingqiwang@hust.edu.cn>
Cherry-pick 62 bugfix/security PRs from upstream vllm-project/vllm main (2026-05-03 to 2026-06-17), covering scheduler, engine core, model runner, worker, attention, KV cache, compilation, and structured output fixes. Security (4): vllm-project#43286 vllm-project#44744 vllm-project#45118 vllm-project#45252 Bugfix (56): vllm-project#35536 vllm-project#36616 vllm-project#38895 vllm-project#39155 vllm-project#39324 vllm-project#39562 vllm-project#39805 vllm-project#40398 vllm-project#40726 vllm-project#40727 vllm-project#40737 vllm-project#40749 vllm-project#40961 vllm-project#41119 vllm-project#41133 vllm-project#41233 vllm-project#41237 vllm-project#41411 vllm-project#41496 vllm-project#41549 vllm-project#41674 vllm-project#41873 vllm-project#41895 vllm-project#42040 vllm-project#42112 vllm-project#42289 vllm-project#42479 vllm-project#42585 vllm-project#42692 vllm-project#42706 vllm-project#42709 vllm-project#42739 vllm-project#42967 vllm-project#43001 vllm-project#43079 vllm-project#43125 vllm-project#43160 vllm-project#43616 vllm-project#43669 vllm-project#43719 vllm-project#43768 vllm-project#43808 vllm-project#43961 vllm-project#43982 vllm-project#43988 vllm-project#43998 vllm-project#44057 vllm-project#44560 vllm-project#44574 vllm-project#44568 vllm-project#44603 vllm-project#44744 vllm-project#45195 vllm-project#45345 vllm-project#45383 vllm-project#45487 vllm-project#45564 vllm-project#45673 Runner fix (2): vllm-project#44568 vllm-project#44603 Skipped: vllm-project#43781 (ROCm-specific, not applicable to Ascend NPU) Conflict resolutions: - Manual merge: vllm-project#43286 vllm-project#45118 vllm-project#42112 vllm-project#43160 vllm-project#43719 vllm-project#44560 - Upstream-preferred (-X theirs): vllm-project#43808 vllm-project#43988 vllm-project#42967 vllm-project#35536 vllm-project#45195 - Test files (--theirs): vllm-project#44744 vllm-project#41895 vllm-project#42040 vllm-project#41233 vllm-project#45345 vllm-project#43982 Co-authored-by: GitHub Copilot Signed-off-by: MingqiWang-coder <mingqiwang@hust.edu.cn>
Cherry-pick 62 bugfix/security PRs from upstream vllm-project/vllm main (2026-05-03 to 2026-06-17), covering scheduler, engine core, model runner, worker, attention, KV cache, compilation, and structured output fixes. Security (4): vllm-project#43286 vllm-project#44744 vllm-project#45118 vllm-project#45252 Bugfix (56): vllm-project#35536 vllm-project#36616 vllm-project#38895 vllm-project#39155 vllm-project#39324 vllm-project#39562 vllm-project#39805 vllm-project#40398 vllm-project#40726 Runner fix (2): vllm-project#44568 vllm-project#44603 Skipped: vllm-project#43781 (ROCm-specific, not applicable to Ascend NPU) Conflict resolutions: - Manual merge: vllm-project#43286 vllm-project#45118 vllm-project#42112 vllm-project#43160 vllm-project#43719 vllm-project#44560 - Upstream-preferred (-X theirs): vllm-project#43808 vllm-project#43988 vllm-project#42967 vllm-project#35536 vllm-project#45195 - Test files (--theirs): vllm-project#44744 vllm-project#41895 vllm-project#42040 vllm-project#41233 vllm-project#45345 vllm-project#43982 Co-authored-by: GitHub Copilot Signed-off-by: MingqiWang-coder <mingqiwang@hust.edu.cn>
Summary
RayExecutorV2names its TP worker actors asvllm_Worker_{instance_id}[_TP{n}](seevllm/v1/executor/ray_utils.py::build_actor_name). Whendata_parallel_size > 1,CoreEngineActorManager.__init__produces each DP engine'sVllmConfigviacopy.deepcopy(vllm_config), which preserves the originalinstance_idacross all DP replicas.instance_id, every DP engine attempts to create Ray actors with the same names and all but the first crash with:instance_idin each per-engine config copy, matching the existing precedent that does the same forkv_transfer_config.engine_idin the same function. Gated ondp_size > 1so single-DP deployments are unaffected.Bug only reproduces when
VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1(added in #36836); the legacyRayDistributedExecutordoesn't use named Ray actors and is unaffected.Test plan
AI assistance disclosure
AI assistance was used to audit `instance_id` usage across the codebase and draft the patch.