[train] Fix exclude_resources regression for V1 Train + V2 cluster autoscaler by JasonLi1909 · Pull Request #62827 · ray-project/ray

JasonLi1909 · 2026-04-21T19:29:19Z

Summary

#61703 replaced Train's use of exclude_resources with resource requests to Ray Data's V2 Cluster Autoscaler.

This regressed the Train V1 + V2 Autoscaler combination: Train V1 has no ScalingPolicy to register training's footprint with the autoscaler, and the new logic also skipped exclude_resources, so Ray Data no longer accounted for training's CPUs/GPUs and over-booked the cluster.

This PR restores exclude_resources whenever Train V1 is in use, gating the new coordinator path on Train V2 + V2 Autoscaler only.

Changes

python/ray/train/_internal/data_config.py:
- Replace the not self._is_v2_autoscaler() gate with not self._scaling_policy_reserves_train_resources().
- Add _scaling_policy_reserves_train_resources() helper

Matrix

Train	Cluster autoscaler	`exclude_resources` set?	Why
V1	V1	yes	V1 autoscaler subtracts from global limits
V1	V2	yes (fixed)	V1 Train has no coordinator registration path
V2	V1	yes	V1 autoscaler ignores coordinator registration
V2	V2	no	`ScalingPolicy` registers with `AutoscalingCoordinator`

Tests

test_v1_train_with_v2_data_autoscaler_sets_exclude_resources — new regression test for Train V1 + V2 Cluster Autoscaler

…toscaler PR ray-project#61703 gated `DataConfig.configure()`'s exclude_resources augmentation on the Ray Data cluster autoscaler version. Under V1 Train + V2 cluster autoscaler, that skips exclude_resources AND there is no ScalingPolicy to register training resources with the AutoscalingCoordinator (V1 Train has no scaling policy). Ray Data then over-books the cluster and can deadlock when multiple datasets run concurrently with training. Change the gate to require both V2 Train AND the V2 cluster autoscaler, which is the only combination that wires coordinator registration end-to-end. Any other cell in the matrix must still set exclude_resources. Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

gemini-code-assist

Code Review

This pull request updates the resource reservation logic in Ray Train's DataConfig to ensure that training resources are correctly excluded from Ray Data's resource pool when using Ray Train V1 in conjunction with the V2 cluster autoscaler. It introduces the _scaling_policy_reserves_train_resources method to determine if resource reservation is handled by the scaling policy or if it must be manually configured via exclude_resources. Additionally, a regression test was added to verify this behavior. Feedback indicates that the current check for Ray Train V2 enablement relies on a global environment variable, which may incorrectly skip resource reservation for V1 trainers if the V2 flag is active, potentially leading to deadlocks.

…ression

…toscaler (ray-project#62827) ## Summary ray-project#61703 replaced Train's use of `exclude_resources` with resource requests to Ray Data's V2 Cluster Autoscaler. This regressed the Train V1 + V2 Autoscaler combination: Train V1 has no ScalingPolicy to register training's footprint with the autoscaler, and the new logic also skipped `exclude_resources`, so Ray Data no longer accounted for training's CPUs/GPUs and over-booked the cluster. This PR restores `exclude_resources` whenever Train V1 is in use, gating the new coordinator path on Train V2 + V2 Autoscaler only. ## Changes - `python/ray/train/_internal/data_config.py`: - Replace the `not self._is_v2_autoscaler()` gate with `not self._scaling_policy_reserves_train_resources()`. - Add `_scaling_policy_reserves_train_resources()` helper ### Matrix | Train | Cluster autoscaler | `exclude_resources` set? | Why | |---|---|---|---| | V1 | V1 | yes | V1 autoscaler subtracts from global limits | | V1 | V2 | yes (**fixed**) | V1 Train has no coordinator registration path | | V2 | V1 | yes | V1 autoscaler ignores coordinator registration | | V2 | V2 | no | `ScalingPolicy` registers with `AutoscalingCoordinator` | ## Tests - `test_v1_train_with_v2_data_autoscaler_sets_exclude_resources` — new regression test for Train V1 + V2 Cluster Autoscaler Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

andrewsykim · 2026-05-07T14:57:42Z

@JasonLi1909 is this worth backporting to a 2.55 patch version?

matthewdeng · 2026-05-07T19:06:13Z

@andrewsykim yes that would be good.

…toscaler (ray-project#62827) ## Summary ray-project#61703 replaced Train's use of `exclude_resources` with resource requests to Ray Data's V2 Cluster Autoscaler. This regressed the Train V1 + V2 Autoscaler combination: Train V1 has no ScalingPolicy to register training's footprint with the autoscaler, and the new logic also skipped `exclude_resources`, so Ray Data no longer accounted for training's CPUs/GPUs and over-booked the cluster. This PR restores `exclude_resources` whenever Train V1 is in use, gating the new coordinator path on Train V2 + V2 Autoscaler only. ## Changes - `python/ray/train/_internal/data_config.py`: - Replace the `not self._is_v2_autoscaler()` gate with `not self._scaling_policy_reserves_train_resources()`. - Add `_scaling_policy_reserves_train_resources()` helper ### Matrix | Train | Cluster autoscaler | `exclude_resources` set? | Why | |---|---|---|---| | V1 | V1 | yes | V1 autoscaler subtracts from global limits | | V1 | V2 | yes (**fixed**) | V1 Train has no coordinator registration path | | V2 | V1 | yes | V1 autoscaler ignores coordinator registration | | V2 | V2 | no | `ScalingPolicy` registers with `AutoscalingCoordinator` | ## Tests - `test_v1_train_with_v2_data_autoscaler_sets_exclude_resources` — new regression test for Train V1 + V2 Cluster Autoscaler Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

JasonLi1909 requested a review from a team as a code owner April 21, 2026 19:29

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread python/ray/train/_internal/data_config.py

matthewdeng approved these changes Apr 21, 2026

View reviewed changes

matthewdeng enabled auto-merge (squash) April 21, 2026 20:22

github-actions Bot added the go add ONLY when ready to merge, run all tests label Apr 21, 2026

Merge branch 'master' into jasonli/fix-v1-train-exclude-resources-reg…

29db8d0

…ression

github-actions Bot disabled auto-merge April 21, 2026 21:36

matthewdeng merged commit 971180d into ray-project:master Apr 21, 2026
6 checks passed

andrewsykim added the backport-candidate Label to identify PRs that should be considered for backport to older versions. label May 13, 2026

andrewsykim added backport-approved Label indicating a backport is approved. and removed backport-candidate Label to identify PRs that should be considered for backport to older versions. labels May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[train] Fix exclude_resources regression for V1 Train + V2 cluster autoscaler#62827

[train] Fix exclude_resources regression for V1 Train + V2 cluster autoscaler#62827
matthewdeng merged 2 commits into
ray-project:masterfrom
JasonLi1909:jasonli/fix-v1-train-exclude-resources-regression

JasonLi1909 commented Apr 21, 2026 •

edited by matthewdeng

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

andrewsykim commented May 7, 2026

matthewdeng commented May 7, 2026

Labels

3 participants

Uh oh!

Conversation

JasonLi1909 commented Apr 21, 2026 • edited by matthewdeng Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Matrix

Tests

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

andrewsykim commented May 7, 2026

matthewdeng commented May 7, 2026

Labels

3 participants

JasonLi1909 commented Apr 21, 2026 •

edited by matthewdeng

Loading