Skip to content

[train][data] Forward label_selector to AutoscalingCoordinator#63287

Merged
justinvyu merged 7 commits into
ray-project:masterfrom
liulehui:fractional_gpus
May 15, 2026
Merged

[train][data] Forward label_selector to AutoscalingCoordinator#63287
justinvyu merged 7 commits into
ray-project:masterfrom
liulehui:fractional_gpus

Conversation

@liulehui

Copy link
Copy Markdown
Contributor

Description

  1. previously when we add label_selector support in [train] Add bundle_label_selector to ScalingConfig #58845 , there was no resource_requests in FixedScalingPolicy, which is added in [train] Register training resources with AutoscalingCoordinator in FixedScalingPolicy #61703
  2. after this change, the normal FixScalingPolicy will also send autoscaling request for fixedScalingPolicy, but without forwarding the label selectors, since the AutoscalingCoordinator does not support label selector field.
  3. this caused an issue: ray autoscaler sees a bare unlabeled {"CPU": N} demand along side the labeled PlacementGroup demand, which can cause it to scale up the wrong worker group, see a repro in the additional information.

Related issues

Related to #63241, note that we are not addressing the fractional usage in this PR yet.

Additional information

repro: https://gist.github.com/liulehui/040e4ff57840ffebca48b826ca5a3b00

before this change, it will hang because autoscaler scale up large worker instead of xlarge, cause PlacementGroup to hang.

after this change, it works correctly.

@liulehui liulehui requested review from a team as code owners May 12, 2026 00:12
@liulehui liulehui added the go add ONLY when ready to merge, run all tests label May 12, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for label_selectors in the Ray Autoscaling Coordinator, enabling more granular resource requests for Ray Train v2. Key changes include updating the request_resources interface across the base, default, and fake coordinators, implementing selector validation and merging logic in the DefaultAutoscalingCoordinator, and refactoring ScalingConfig to provide per-worker label selectors. Review feedback suggests refactoring a complex lambda function in the _AutoscalingCoordinatorActor constructor into a separate helper function to improve code readability and maintainability.

Comment thread python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py Outdated
@liulehui liulehui changed the title [train] Forward ScalingConfig.label_selector to AutoscalingCoordinator May 12, 2026
@ray-gardener ray-gardener Bot added train Ray Train Related Issue data Ray Data-related issues labels May 12, 2026
@liulehui liulehui changed the title [train][data] Forward ScalingConfig.label_selector to AutoscalingCoordinator May 12, 2026

@TimothySeah TimothySeah left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Comment thread python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py Outdated
Comment thread python/ray/train/v2/tests/test_data_integration.py Outdated

@rayhhome rayhhome left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall; left some comments on readability

Comment thread python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py Outdated
Comment thread python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py Outdated
Comment thread python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 3d2fa8a890adbdfc298b2e7699517ccf1f80a87c. Configure here.

Comment thread python/ray/train/v2/_internal/execution/controller/controller.py

@TimothySeah TimothySeah left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing my approval for now due to the _tick loop forwarding that we discussed offline.

@TimothySeah TimothySeah left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this PR as discussed offline - this doesn't introduce any regressions, and a future PR will make sure that ray data won't schedule tasks on ray-train-requested nodes.

@justinvyu justinvyu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix!

Comment thread python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py Outdated
liulehui added 6 commits May 14, 2026 16:44
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
@justinvyu justinvyu merged commit 68d4897 into ray-project:master May 15, 2026
6 checks passed
TruongQuangPhat pushed a commit to cyhapun/ray-fix-issue that referenced this pull request May 27, 2026
…roject#63287)

Fix the following issue:
1. previously when we add `label_selector` support in ray-project#58845 , there was
no `resource_requests` in
[`FixedScalingPolicy`](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/scaling_policy/scaling_policy.py#L111-L137),
which is added in ray-project#61703
2. after this change, the normal FixScalingPolicy will also send
autoscaling request for fixedScalingPolicy, but without forwarding the
label selectors, since the AutoscalingCoordinator does not support label
selector field.
3. this caused an issue: `ray autoscaler` sees a bare unlabeled `{"CPU":
N}` demand along side the **labeled** PlacementGroup demand, which can
cause it to scale up the wrong worker group, see a repro in the
additional information.

---------

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests train Ray Train Related Issue

4 participants