Skip to content

[train] Fix exclude_resources regression for V1 Train + V2 cluster autoscaler#62827

Merged
matthewdeng merged 2 commits into
ray-project:masterfrom
JasonLi1909:jasonli/fix-v1-train-exclude-resources-regression
Apr 21, 2026
Merged

[train] Fix exclude_resources regression for V1 Train + V2 cluster autoscaler#62827
matthewdeng merged 2 commits into
ray-project:masterfrom
JasonLi1909:jasonli/fix-v1-train-exclude-resources-regression

Conversation

@JasonLi1909

@JasonLi1909 JasonLi1909 commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Summary

#61703 replaced Train's use of exclude_resources with resource requests to Ray Data's V2 Cluster Autoscaler.

This regressed the Train V1 + V2 Autoscaler combination: Train V1 has no ScalingPolicy to register training's footprint with the autoscaler, and the new logic also skipped exclude_resources, so Ray Data no longer accounted for training's CPUs/GPUs and over-booked the cluster.

This PR restores exclude_resources whenever Train V1 is in use, gating the new coordinator path on Train V2 + V2 Autoscaler only.

Changes

  • python/ray/train/_internal/data_config.py:
    • Replace the not self._is_v2_autoscaler() gate with not self._scaling_policy_reserves_train_resources().
    • Add _scaling_policy_reserves_train_resources() helper

Matrix

Train Cluster autoscaler exclude_resources set? Why
V1 V1 yes V1 autoscaler subtracts from global limits
V1 V2 yes (fixed) V1 Train has no coordinator registration path
V2 V1 yes V1 autoscaler ignores coordinator registration
V2 V2 no ScalingPolicy registers with AutoscalingCoordinator

Tests

  • test_v1_train_with_v2_data_autoscaler_sets_exclude_resources — new regression test for Train V1 + V2 Cluster Autoscaler
…toscaler

PR ray-project#61703 gated `DataConfig.configure()`'s exclude_resources augmentation
on the Ray Data cluster autoscaler version. Under V1 Train + V2 cluster
autoscaler, that skips exclude_resources AND there is no ScalingPolicy to
register training resources with the AutoscalingCoordinator (V1 Train has
no scaling policy). Ray Data then over-books the cluster and can deadlock
when multiple datasets run concurrently with training.

Change the gate to require both V2 Train AND the V2 cluster autoscaler,
which is the only combination that wires coordinator registration
end-to-end. Any other cell in the matrix must still set exclude_resources.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@JasonLi1909 JasonLi1909 requested a review from a team as a code owner April 21, 2026 19:29

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the resource reservation logic in Ray Train's DataConfig to ensure that training resources are correctly excluded from Ray Data's resource pool when using Ray Train V1 in conjunction with the V2 cluster autoscaler. It introduces the _scaling_policy_reserves_train_resources method to determine if resource reservation is handled by the scaling policy or if it must be manually configured via exclude_resources. Additionally, a regression test was added to verify this behavior. Feedback indicates that the current check for Ray Train V2 enablement relies on a global environment variable, which may incorrectly skip resource reservation for V1 trainers if the V2 flag is active, potentially leading to deadlocks.

Comment thread python/ray/train/_internal/data_config.py
@matthewdeng matthewdeng enabled auto-merge (squash) April 21, 2026 20:22
@github-actions github-actions Bot added the go add ONLY when ready to merge, run all tests label Apr 21, 2026
@github-actions github-actions Bot disabled auto-merge April 21, 2026 21:36
@matthewdeng matthewdeng merged commit 971180d into ray-project:master Apr 21, 2026
6 checks passed
HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026
…toscaler (ray-project#62827)

## Summary

ray-project#61703 replaced Train's use of `exclude_resources` with resource
requests to Ray Data's V2 Cluster Autoscaler.

This regressed the Train V1 + V2 Autoscaler combination: Train V1 has no
ScalingPolicy to register training's footprint with the autoscaler, and
the new logic also skipped `exclude_resources`, so Ray Data no longer
accounted for training's CPUs/GPUs and over-booked the cluster.

This PR restores `exclude_resources` whenever Train V1 is in use, gating
the new coordinator path on Train V2 + V2 Autoscaler only.

## Changes

- `python/ray/train/_internal/data_config.py`:
- Replace the `not self._is_v2_autoscaler()` gate with `not
self._scaling_policy_reserves_train_resources()`.
    - Add `_scaling_policy_reserves_train_resources()` helper
    
### Matrix

| Train | Cluster autoscaler | `exclude_resources` set? | Why |
|---|---|---|---|
| V1 | V1 | yes | V1 autoscaler subtracts from global limits |
| V1 | V2 | yes (**fixed**) | V1 Train has no coordinator registration
path |
| V2 | V1 | yes | V1 autoscaler ignores coordinator registration |
| V2 | V2 | no | `ScalingPolicy` registers with `AutoscalingCoordinator`
|

## Tests

- `test_v1_train_with_v2_data_autoscaler_sets_exclude_resources` — new
regression test for Train V1 + V2 Cluster Autoscaler

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@andrewsykim

Copy link
Copy Markdown
Member

@JasonLi1909 is this worth backporting to a 2.55 patch version?

@matthewdeng

Copy link
Copy Markdown
Contributor

@andrewsykim yes that would be good.

@andrewsykim andrewsykim added the backport-candidate Label to identify PRs that should be considered for backport to older versions. label May 13, 2026
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…toscaler (ray-project#62827)

## Summary

ray-project#61703 replaced Train's use of `exclude_resources` with resource
requests to Ray Data's V2 Cluster Autoscaler.

This regressed the Train V1 + V2 Autoscaler combination: Train V1 has no
ScalingPolicy to register training's footprint with the autoscaler, and
the new logic also skipped `exclude_resources`, so Ray Data no longer
accounted for training's CPUs/GPUs and over-booked the cluster.

This PR restores `exclude_resources` whenever Train V1 is in use, gating
the new coordinator path on Train V2 + V2 Autoscaler only.

## Changes

- `python/ray/train/_internal/data_config.py`:
- Replace the `not self._is_v2_autoscaler()` gate with `not
self._scaling_policy_reserves_train_resources()`.
    - Add `_scaling_policy_reserves_train_resources()` helper
    
### Matrix

| Train | Cluster autoscaler | `exclude_resources` set? | Why |
|---|---|---|---|
| V1 | V1 | yes | V1 autoscaler subtracts from global limits |
| V1 | V2 | yes (**fixed**) | V1 Train has no coordinator registration
path |
| V2 | V1 | yes | V1 autoscaler ignores coordinator registration |
| V2 | V2 | no | `ScalingPolicy` registers with `AutoscalingCoordinator`
|

## Tests

- `test_v1_train_with_v2_data_autoscaler_sets_exclude_resources` — new
regression test for Train V1 + V2 Cluster Autoscaler

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@andrewsykim andrewsykim added backport-approved Label indicating a backport is approved. and removed backport-candidate Label to identify PRs that should be considered for backport to older versions. labels May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-approved Label indicating a backport is approved. go add ONLY when ready to merge, run all tests

3 participants