[Data] Make Auto-downscaling Resource Aware#62574
Conversation
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces budget-aware downscaling for actor pools, ensuring that pools scale down when they exceed resource allocations regardless of current utilization. The changes include a new budget check in the scaling logic and a helper function to calculate the required scale-down amount. Review feedback highlights a potential assertion failure in upscaling calculations when budgets are negative, suggests a less conservative approach to downscaling when actors are pending to enforce budgets more strictly, and recommends using a tolerance threshold for floating-point budget comparisons to prevent unnecessary churn. Additionally, there is a suggestion to refactor duplicated helper logic in the test suite to improve maintainability.
There was a problem hiding this comment.
Pull request overview
Updates Ray Data’s DefaultActorAutoscaler to incorporate operator resource budgets into downscaling decisions so actor pools can shrink when they are over their allocated resources (even if utilization is high).
Changes:
- Add an over-budget downscaling path to
DefaultActorAutoscaler._derive_target_scaling_config. - Introduce
_get_required_scale_down()to compute the necessary actor reduction based on remaining budget. - Add
test_actor_pool_scaling_over_budgetto validate over-budget downscaling behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py | Adds resource-budget-aware downscaling and helper to compute required scale-down. |
| python/ray/data/tests/test_autoscaler.py | Adds a unit test covering downscaling behavior when the actor pool is over budget. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements resource budget enforcement within the DefaultActorAutoscaler, enabling downscaling when an actor pool exceeds its allocated resources. Key changes include the addition of a _get_required_scale_down helper function, budget clamping for scale-up operations, and updated unit tests for budget enforcement. Review feedback highlights a redundant call to fetch the resource budget and suggests refining the scale-down calculation by subtracting the epsilon value before applying math.ceil to prevent over-aggressive downscaling due to floating-point noise.
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Reviewed by Cursor Bugbot for commit 5b3123a. Configure here.
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
f3a5ed0 to
a1bd435
Compare
|
|
||
| Args: | ||
| actor_pool: The actor pool to scale down. | ||
| budget: The raw remaining budget (allocation - usage) (Note this is different |
There was a problem hiding this comment.
Nit: I don't think it's clear what "raw" means in this context
There was a problem hiding this comment.
Changed to "net" in new comment. I also think (allocation - usage) could give future some idea what this budget represent.
| Args: | ||
| actor_pool: The actor pool to scale down. | ||
| budget: The raw remaining budget (allocation - usage) (Note this is different | ||
| from the budget returned by `ResourceManager.get_budget()` and can be negative). |
There was a problem hiding this comment.
Don't think this is accurate anymore
There was a problem hiding this comment.
Addressed in new commit.
| per_actor = actor_pool.per_actor_resource_usage() | ||
|
|
||
| required_cpu_scale_down = 0 | ||
| if per_actor.cpu > 0 and budget.cpu < 0: | ||
| required_cpu_scale_down = math.ceil(abs(budget.cpu) / per_actor.cpu) | ||
|
|
||
| required_gpu_scale_down = 0 | ||
| if per_actor.gpu > 0 and budget.gpu < 0: | ||
| required_gpu_scale_down = math.ceil(abs(budget.gpu) / per_actor.gpu) | ||
|
|
||
| required_memory_scale_down = 0 | ||
| if per_actor.memory > 0 and budget.memory < 0: | ||
| required_memory_scale_down = math.ceil(abs(budget.memory) / per_actor.memory) | ||
|
|
||
| return max( | ||
| required_cpu_scale_down, required_gpu_scale_down, required_memory_scale_down | ||
| ) |
There was a problem hiding this comment.
Would it make sense to re-use ExecutionOptions.floordiv here?
There was a problem hiding this comment.
For scaling down, we have to use ceil instead of floor to conservatively cut down resource usage for correctness. It actually makes more sense to use ExecutionOptions.floordiv in _get_max_scale_up than in _get_required_scale_down. Should I just use it in _get_max_scale_up, or would it be better if I generalize ExecutionOptions.floordiv to be able to apply either floor or ceil based on the use case?
There was a problem hiding this comment.
Temporarily applied ExecutionOptions.floordiv for _get_max_scale_up.
| @@ -235,7 +235,8 @@ def assert_autoscaling_action( | |||
| @pytest.fixture | |||
| def autoscaler_max_upscaling_delta_setup(): | |||
There was a problem hiding this comment.
Would it be possible to still add tests? I think we'd mock get_allocation anyway, so I don't think there are issues if it's off?
There was a problem hiding this comment.
Added tests in new commit.
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
## Description The `DefaultActorAutoscaler` in Ray Data does not consider resource allocation when making downscaling decisions. If an actor pool operator's resource allocation is reduced (e.g., by the `ResourceManager` rebalancing budgets), the actor pool can remain over-budget indefinitely as long as its utilization stays above the downscaling threshold. This PR adds logic so that when the actor pool exceeds the resource allocation, it scales down regardless of its overall utilization. ## Additional information Added `test_actor_pool_scaling_over_budget` to check actor pools downscale when over their resource allocation. **Follow up: the current implementation of `get_allocation()` is buggy because the operator budget is clamped to 0 in `update_budget`. PR required to fix this!** --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
## Description The `DefaultActorAutoscaler` in Ray Data does not consider resource allocation when making downscaling decisions. If an actor pool operator's resource allocation is reduced (e.g., by the `ResourceManager` rebalancing budgets), the actor pool can remain over-budget indefinitely as long as its utilization stays above the downscaling threshold. This PR adds logic so that when the actor pool exceeds the resource allocation, it scales down regardless of its overall utilization. ## Additional information Added `test_actor_pool_scaling_over_budget` to check actor pools downscale when over their resource allocation. **Follow up: the current implementation of `get_allocation()` is buggy because the operator budget is clamped to 0 in `update_budget`. PR required to fix this!** --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Description
The
DefaultActorAutoscalerin Ray Data does not consider resource allocation when making downscaling decisions. If an actor pool operator's resource allocation is reduced (e.g., by theResourceManagerrebalancing budgets), the actor pool can remain over-budget indefinitely as long as its utilization stays above the downscaling threshold.This PR adds logic so that when the actor pool exceeds the resource allocation, it scales down regardless of its overall utilization.
Additional information
Added
test_actor_pool_scaling_over_budgetto check actor pools downscale when over their resource allocation.Follow up: the current implementation of
get_allocation()is buggy because the operator budget is clamped to 0 inupdate_budget. PR targeting the fix: #62649