[train][data] Forward label_selector to AutoscalingCoordinator by liulehui · Pull Request #63287 · ray-project/ray

liulehui · 2026-05-12T00:12:22Z

Description

previously when we add label_selector support in [train] Add bundle_label_selector to ScalingConfig #58845 , there was no resource_requests in FixedScalingPolicy, which is added in [train] Register training resources with AutoscalingCoordinator in FixedScalingPolicy #61703
after this change, the normal FixScalingPolicy will also send autoscaling request for fixedScalingPolicy, but without forwarding the label selectors, since the AutoscalingCoordinator does not support label selector field.
this caused an issue: ray autoscaler sees a bare unlabeled {"CPU": N} demand along side the labeled PlacementGroup demand, which can cause it to scale up the wrong worker group, see a repro in the additional information.

Related issues

Related to #63241, note that we are not addressing the fractional usage in this PR yet.

Additional information

repro: https://gist.github.com/liulehui/040e4ff57840ffebca48b826ca5a3b00

before this change, it will hang because autoscaler scale up large worker instead of xlarge, cause PlacementGroup to hang.

after this change, it works correctly.

gemini-code-assist

Code Review

This pull request introduces support for label_selectors in the Ray Autoscaling Coordinator, enabling more granular resource requests for Ray Train v2. Key changes include updating the request_resources interface across the base, default, and fake coordinators, implementing selector validation and merging logic in the DefaultAutoscalingCoordinator, and refactoring ScalingConfig to provide per-worker label selectors. Review feedback suggests refactoring a complex lambda function in the _AutoscalingCoordinatorActor constructor into a separate helper function to improve code readability and maintainability.

TimothySeah

LGTM, thanks!

rayhhome

LGTM overall; left some comments on readability

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 3d2fa8a890adbdfc298b2e7699517ccf1f80a87c. Configure here.}

TimothySeah

Removing my approval for now due to the _tick loop forwarding that we discussed offline.

TimothySeah

Approving this PR as discussed offline - this doesn't introduce any regressions, and a future PR will make sure that ray data won't schedule tasks on ray-train-requested nodes.

justinvyu

thanks for the fix!

Signed-off-by: Lehui Liu <lehui@anyscale.com>

…roject#63287) Fix the following issue: 1. previously when we add `label_selector` support in ray-project#58845 , there was no `resource_requests` in [`FixedScalingPolicy`](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/scaling_policy/scaling_policy.py#L111-L137), which is added in ray-project#61703 2. after this change, the normal FixScalingPolicy will also send autoscaling request for fixedScalingPolicy, but without forwarding the label selectors, since the AutoscalingCoordinator does not support label selector field. 3. this caused an issue: `ray autoscaler` sees a bare unlabeled `{"CPU": N}` demand along side the **labeled** PlacementGroup demand, which can cause it to scale up the wrong worker group, see a repro in the additional information. --------- Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>

liulehui requested review from a team as code owners May 12, 2026 00:12

liulehui added the go add ONLY when ready to merge, run all tests label May 12, 2026

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

Comment thread python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py Outdated

liulehui changed the title ~~[train] Forward ScalingConfig.label_selector to AutoscalingCoordinator~~ May 12, 2026

ray-gardener Bot added train Ray Train Related Issue data Ray Data-related issues labels May 12, 2026

liulehui changed the title ~~[train][data] Forward ScalingConfig.label_selector to AutoscalingCoordinator~~ May 12, 2026

TimothySeah approved these changes May 12, 2026

View reviewed changes

Comment thread python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py Outdated

Comment thread python/ray/train/v2/tests/test_data_integration.py Outdated

rayhhome approved these changes May 13, 2026

View reviewed changes

cursor Bot reviewed May 14, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/execution/controller/controller.py

TimothySeah requested changes May 14, 2026

View reviewed changes

TimothySeah approved these changes May 14, 2026

View reviewed changes

justinvyu approved these changes May 14, 2026

View reviewed changes

Comment thread python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py Outdated

liulehui added 6 commits May 14, 2026 16:44

quick fix

1724a81

Signed-off-by: Lehui Liu <lehui@anyscale.com>

add label selector for scaling policy

4e62af4

Signed-off-by: Lehui Liu <lehui@anyscale.com>

addressing comments

2ecb7a7

Signed-off-by: Lehui Liu <lehui@anyscale.com>

addressing comments

e4a6dec

Signed-off-by: Lehui Liu <lehui@anyscale.com>

addressing comments

8c7cfd3

Signed-off-by: Lehui Liu <lehui@anyscale.com>

address comment

22a09d3

Signed-off-by: Lehui Liu <lehui@anyscale.com>

liulehui force-pushed the fractional_gpus branch from c7fd61b to 22a09d3 Compare May 15, 2026 00:01

fix lint

24412c2

Signed-off-by: Lehui Liu <lehui@anyscale.com>

justinvyu merged commit 68d4897 into ray-project:master May 15, 2026
6 checks passed

liulehui mentioned this pull request May 18, 2026

[Train] Ray Train not respecting fractional resource requests and label selectors #63241

Closed

TimothySeah mentioned this pull request Jun 8, 2026

[data] Support multiple datasets in a cluster (2/2): partition cluster resources by subcluster label #63375

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[train][data] Forward label_selector to AutoscalingCoordinator#63287

[train][data] Forward label_selector to AutoscalingCoordinator#63287
justinvyu merged 7 commits into
ray-project:masterfrom
liulehui:fractional_gpus

liulehui commented May 12, 2026

gemini-code-assist Bot left a comment

Uh oh!

TimothySeah left a comment

Uh oh!

Uh oh!

rayhhome left a comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

TimothySeah left a comment

TimothySeah left a comment

justinvyu left a comment

Uh oh!

Uh oh!

Labels

4 participants

Uh oh!

Conversation

liulehui commented May 12, 2026

Description

Related issues

Additional information

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

TimothySeah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rayhhome left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

TimothySeah left a comment

Choose a reason for hiding this comment

TimothySeah left a comment

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

4 participants