[train][data] Forward label_selector to AutoscalingCoordinator#63287
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for label_selectors in the Ray Autoscaling Coordinator, enabling more granular resource requests for Ray Train v2. Key changes include updating the request_resources interface across the base, default, and fake coordinators, implementing selector validation and merging logic in the DefaultAutoscalingCoordinator, and refactoring ScalingConfig to provide per-worker label selectors. Review feedback suggests refactoring a complex lambda function in the _AutoscalingCoordinatorActor constructor into a separate helper function to improve code readability and maintainability.
rayhhome
left a comment
There was a problem hiding this comment.
LGTM overall; left some comments on readability
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 3d2fa8a890adbdfc298b2e7699517ccf1f80a87c. Configure here.
TimothySeah
left a comment
There was a problem hiding this comment.
Removing my approval for now due to the _tick loop forwarding that we discussed offline.
TimothySeah
left a comment
There was a problem hiding this comment.
Approving this PR as discussed offline - this doesn't introduce any regressions, and a future PR will make sure that ray data won't schedule tasks on ray-train-requested nodes.
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
…roject#63287) Fix the following issue: 1. previously when we add `label_selector` support in ray-project#58845 , there was no `resource_requests` in [`FixedScalingPolicy`](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/scaling_policy/scaling_policy.py#L111-L137), which is added in ray-project#61703 2. after this change, the normal FixScalingPolicy will also send autoscaling request for fixedScalingPolicy, but without forwarding the label selectors, since the AutoscalingCoordinator does not support label selector field. 3. this caused an issue: `ray autoscaler` sees a bare unlabeled `{"CPU": N}` demand along side the **labeled** PlacementGroup demand, which can cause it to scale up the wrong worker group, see a repro in the additional information. --------- Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>

Description
label_selectorsupport in [train] Add bundle_label_selector to ScalingConfig #58845 , there was noresource_requestsinFixedScalingPolicy, which is added in [train] Register training resources with AutoscalingCoordinator in FixedScalingPolicy #61703ray autoscalersees a bare unlabeled{"CPU": N}demand along side the labeled PlacementGroup demand, which can cause it to scale up the wrong worker group, see a repro in the additional information.Related issues
Related to #63241, note that we are not addressing the fractional usage in this PR yet.
Additional information
repro: https://gist.github.com/liulehui/040e4ff57840ffebca48b826ca5a3b00
before this change, it will hang because autoscaler scale up
largeworker instead ofxlarge, cause PlacementGroup to hang.after this change, it works correctly.