Skip to content

[train] Add bundle_label_selector to ScalingConfig#58845

Merged
justinvyu merged 8 commits into
ray-project:masterfrom
TimothySeah:tseah/bundle-selector-to-scaling-config
Nov 30, 2025
Merged

[train] Add bundle_label_selector to ScalingConfig#58845
justinvyu merged 8 commits into
ray-project:masterfrom
TimothySeah:tseah/bundle-selector-to-scaling-config

Conversation

@TimothySeah

@TimothySeah TimothySeah commented Nov 20, 2025

Copy link
Copy Markdown
Contributor

Summary

This PR adds a bundle_label_selector argument to the ScalingConfig that allows Ray Train workers to be placed on nodes with particular labels. The previous workaround, namely using resources_per_worker, is less flexible.

bundle_label_selector can either be a single dict, in which case it will apply to all the workers, or a list of length num_workers, in which case each item in the list will correspond to one of the workers.

I added verification to the controller instead of validating that none of the callbacks have on_controller_start_worker_group when bundle_label_selector is set because we might change on_controller_start_worker_group in the future. We can revisit this issue then.

Testing

Unit tests

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from a team as a code owner November 20, 2025 03:48

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a bundle_label_selector to ScalingConfig, enabling more granular control over worker placement by using node labels. The implementation is solid, with appropriate validation and comprehensive unit tests. I've identified a significant issue in the controller logic where creating copies of the bundle selector for multiple workers results in all workers sharing the same dictionary instance. This could lead to hard-to-debug side effects if the selector is modified. I've provided suggestions to correct this by using list comprehensions to create truly independent copies. Apart from this, the changes are well-structured and improve the flexibility of worker placement.

Comment thread python/ray/train/v2/_internal/execution/controller/controller.py Outdated
Comment thread python/ray/train/v2/_internal/execution/controller/controller.py Outdated
Comment thread python/ray/train/v2/_internal/execution/controller/controller.py Outdated
Comment thread python/ray/train/v2/_internal/execution/controller/controller.py Outdated
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@ray-gardener ray-gardener Bot added the train Ray Train Related Issue label Nov 20, 2025
Signed-off-by: Timothy Seah <tseah@anyscale.com>

@liulehui liulehui left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty!

@justinvyu justinvyu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😎

Comment thread python/ray/train/v2/tests/test_config.py Outdated
Comment thread python/ray/train/v2/tests/test_controller.py Outdated
Comment thread python/ray/train/v2/_internal/execution/controller/controller.py
Comment thread python/ray/train/v2/api/config.py

@ryanaoleary ryanaoleary left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Timothy Seah <tseah@anyscale.com>
…warning

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Nov 26, 2025
Comment thread python/ray/train/v2/_internal/execution/controller/controller.py

@justinvyu justinvyu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Comment thread python/ray/train/v2/_internal/execution/controller/controller.py Outdated
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@justinvyu justinvyu merged commit 2ab5d20 into ray-project:master Nov 30, 2025
6 checks passed
matthewdeng added a commit that referenced this pull request Dec 13, 2025
…59414)

## Description
Rename `ScalingConfig.bundle_label_selector` to
`ScalingConfig.label_selector` for a cleaner API.

This matches the `@ray.remote` API, as opposed to the `PlacementGroup`
API which uses `bundle_label_selector`.

## Related issues

API was introduced in #58845.

## Additional information

This change is technically backwards incompatible, but
`bundle_label_selector` was just introduced and not part of any minor
version releases yet.

Also made the same changes to `WorkerGroupContext`, and renamed local
variables in `TrainController` and `TPUReservationCallback`

Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>
justinvyu pushed a commit that referenced this pull request May 15, 2026
Fix the following issue:
1. previously when we add `label_selector` support in #58845 , there was
no `resource_requests` in
[`FixedScalingPolicy`](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/scaling_policy/scaling_policy.py#L111-L137),
which is added in #61703
2. after this change, the normal FixScalingPolicy will also send
autoscaling request for fixedScalingPolicy, but without forwarding the
label selectors, since the AutoscalingCoordinator does not support label
selector field.
3. this caused an issue: `ray autoscaler` sees a bare unlabeled `{"CPU":
N}` demand along side the **labeled** PlacementGroup demand, which can
cause it to scale up the wrong worker group, see a repro in the
additional information.

---------

Signed-off-by: Lehui Liu <lehui@anyscale.com>
TruongQuangPhat pushed a commit to cyhapun/ray-fix-issue that referenced this pull request May 27, 2026
…roject#63287)

Fix the following issue:
1. previously when we add `label_selector` support in ray-project#58845 , there was
no `resource_requests` in
[`FixedScalingPolicy`](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/scaling_policy/scaling_policy.py#L111-L137),
which is added in ray-project#61703
2. after this change, the normal FixScalingPolicy will also send
autoscaling request for fixedScalingPolicy, but without forwarding the
label selectors, since the AutoscalingCoordinator does not support label
selector field.
3. this caused an issue: `ray autoscaler` sees a bare unlabeled `{"CPU":
N}` demand along side the **labeled** PlacementGroup demand, which can
cause it to scale up the wrong worker group, see a repro in the
additional information.

---------

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

4 participants