[train] Fix exclude_resources regression for V1 Train + V2 cluster autoscaler#62827
Conversation
…toscaler PR ray-project#61703 gated `DataConfig.configure()`'s exclude_resources augmentation on the Ray Data cluster autoscaler version. Under V1 Train + V2 cluster autoscaler, that skips exclude_resources AND there is no ScalingPolicy to register training resources with the AutoscalingCoordinator (V1 Train has no scaling policy). Ray Data then over-books the cluster and can deadlock when multiple datasets run concurrently with training. Change the gate to require both V2 Train AND the V2 cluster autoscaler, which is the only combination that wires coordinator registration end-to-end. Any other cell in the matrix must still set exclude_resources. Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request updates the resource reservation logic in Ray Train's DataConfig to ensure that training resources are correctly excluded from Ray Data's resource pool when using Ray Train V1 in conjunction with the V2 cluster autoscaler. It introduces the _scaling_policy_reserves_train_resources method to determine if resource reservation is handled by the scaling policy or if it must be manually configured via exclude_resources. Additionally, a regression test was added to verify this behavior. Feedback indicates that the current check for Ray Train V2 enablement relies on a global environment variable, which may incorrectly skip resource reservation for V1 trainers if the V2 flag is active, potentially leading to deadlocks.
…toscaler (ray-project#62827) ## Summary ray-project#61703 replaced Train's use of `exclude_resources` with resource requests to Ray Data's V2 Cluster Autoscaler. This regressed the Train V1 + V2 Autoscaler combination: Train V1 has no ScalingPolicy to register training's footprint with the autoscaler, and the new logic also skipped `exclude_resources`, so Ray Data no longer accounted for training's CPUs/GPUs and over-booked the cluster. This PR restores `exclude_resources` whenever Train V1 is in use, gating the new coordinator path on Train V2 + V2 Autoscaler only. ## Changes - `python/ray/train/_internal/data_config.py`: - Replace the `not self._is_v2_autoscaler()` gate with `not self._scaling_policy_reserves_train_resources()`. - Add `_scaling_policy_reserves_train_resources()` helper ### Matrix | Train | Cluster autoscaler | `exclude_resources` set? | Why | |---|---|---|---| | V1 | V1 | yes | V1 autoscaler subtracts from global limits | | V1 | V2 | yes (**fixed**) | V1 Train has no coordinator registration path | | V2 | V1 | yes | V1 autoscaler ignores coordinator registration | | V2 | V2 | no | `ScalingPolicy` registers with `AutoscalingCoordinator` | ## Tests - `test_v1_train_with_v2_data_autoscaler_sets_exclude_resources` — new regression test for Train V1 + V2 Cluster Autoscaler Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
|
@JasonLi1909 is this worth backporting to a 2.55 patch version? |
|
@andrewsykim yes that would be good. |
…toscaler (ray-project#62827) ## Summary ray-project#61703 replaced Train's use of `exclude_resources` with resource requests to Ray Data's V2 Cluster Autoscaler. This regressed the Train V1 + V2 Autoscaler combination: Train V1 has no ScalingPolicy to register training's footprint with the autoscaler, and the new logic also skipped `exclude_resources`, so Ray Data no longer accounted for training's CPUs/GPUs and over-booked the cluster. This PR restores `exclude_resources` whenever Train V1 is in use, gating the new coordinator path on Train V2 + V2 Autoscaler only. ## Changes - `python/ray/train/_internal/data_config.py`: - Replace the `not self._is_v2_autoscaler()` gate with `not self._scaling_policy_reserves_train_resources()`. - Add `_scaling_policy_reserves_train_resources()` helper ### Matrix | Train | Cluster autoscaler | `exclude_resources` set? | Why | |---|---|---|---| | V1 | V1 | yes | V1 autoscaler subtracts from global limits | | V1 | V2 | yes (**fixed**) | V1 Train has no coordinator registration path | | V2 | V1 | yes | V1 autoscaler ignores coordinator registration | | V2 | V2 | no | `ScalingPolicy` registers with `AutoscalingCoordinator` | ## Tests - `test_v1_train_with_v2_data_autoscaler_sets_exclude_resources` — new regression test for Train V1 + V2 Cluster Autoscaler Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Summary
#61703 replaced Train's use of
exclude_resourceswith resource requests to Ray Data's V2 Cluster Autoscaler.This regressed the Train V1 + V2 Autoscaler combination: Train V1 has no ScalingPolicy to register training's footprint with the autoscaler, and the new logic also skipped
exclude_resources, so Ray Data no longer accounted for training's CPUs/GPUs and over-booked the cluster.This PR restores
exclude_resourceswhenever Train V1 is in use, gating the new coordinator path on Train V2 + V2 Autoscaler only.Changes
python/ray/train/_internal/data_config.py:not self._is_v2_autoscaler()gate withnot self._scaling_policy_reserves_train_resources()._scaling_policy_reserves_train_resources()helperMatrix
exclude_resourcesset?ScalingPolicyregisters withAutoscalingCoordinatorTests
test_v1_train_with_v2_data_autoscaler_sets_exclude_resources— new regression test for Train V1 + V2 Cluster Autoscaler