[train] Add bundle_label_selector to ScalingConfig#58845
Merged
justinvyu merged 8 commits intoNov 30, 2025
Merged
Conversation
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a bundle_label_selector to ScalingConfig, enabling more granular control over worker placement by using node labels. The implementation is solid, with appropriate validation and comprehensive unit tests. I've identified a significant issue in the controller logic where creating copies of the bundle selector for multiple workers results in all workers sharing the same dictionary instance. This could lead to hard-to-debug side effects if the selector is modified. I've provided suggestions to correct this by using list comprehensions to create truly independent copies. Apart from this, the changes are well-structured and improve the flexibility of worker placement.
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
justinvyu
reviewed
Nov 21, 2025
Signed-off-by: Timothy Seah <tseah@anyscale.com>
…warning Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
justinvyu
approved these changes
Nov 26, 2025
Signed-off-by: Timothy Seah <tseah@anyscale.com>
matthewdeng
added a commit
that referenced
this pull request
Dec 13, 2025
…59414) ## Description Rename `ScalingConfig.bundle_label_selector` to `ScalingConfig.label_selector` for a cleaner API. This matches the `@ray.remote` API, as opposed to the `PlacementGroup` API which uses `bundle_label_selector`. ## Related issues API was introduced in #58845. ## Additional information This change is technically backwards incompatible, but `bundle_label_selector` was just introduced and not part of any minor version releases yet. Also made the same changes to `WorkerGroupContext`, and renamed local variables in `TrainController` and `TPUReservationCallback` Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>
justinvyu
pushed a commit
that referenced
this pull request
May 15, 2026
Fix the following issue: 1. previously when we add `label_selector` support in #58845 , there was no `resource_requests` in [`FixedScalingPolicy`](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/scaling_policy/scaling_policy.py#L111-L137), which is added in #61703 2. after this change, the normal FixScalingPolicy will also send autoscaling request for fixedScalingPolicy, but without forwarding the label selectors, since the AutoscalingCoordinator does not support label selector field. 3. this caused an issue: `ray autoscaler` sees a bare unlabeled `{"CPU": N}` demand along side the **labeled** PlacementGroup demand, which can cause it to scale up the wrong worker group, see a repro in the additional information. --------- Signed-off-by: Lehui Liu <lehui@anyscale.com>
TruongQuangPhat
pushed a commit
to cyhapun/ray-fix-issue
that referenced
this pull request
May 27, 2026
…roject#63287) Fix the following issue: 1. previously when we add `label_selector` support in ray-project#58845 , there was no `resource_requests` in [`FixedScalingPolicy`](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/scaling_policy/scaling_policy.py#L111-L137), which is added in ray-project#61703 2. after this change, the normal FixScalingPolicy will also send autoscaling request for fixedScalingPolicy, but without forwarding the label selectors, since the AutoscalingCoordinator does not support label selector field. 3. this caused an issue: `ray autoscaler` sees a bare unlabeled `{"CPU": N}` demand along side the **labeled** PlacementGroup demand, which can cause it to scale up the wrong worker group, see a repro in the additional information. --------- Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a
bundle_label_selectorargument to theScalingConfigthat allows Ray Train workers to be placed on nodes with particular labels. The previous workaround, namely usingresources_per_worker, is less flexible.bundle_label_selectorcan either be a single dict, in which case it will apply to all the workers, or a list of lengthnum_workers, in which case each item in the list will correspond to one of the workers.I added verification to the controller instead of validating that none of the callbacks have
on_controller_start_worker_groupwhenbundle_label_selectoris set because we might changeon_controller_start_worker_groupin the future. We can revisit this issue then.Testing
Unit tests