Skip to content

[Data] Make Auto-downscaling Resource Aware#62574

Merged
bveeramani merged 14 commits into
ray-project:masterfrom
rayhhome:downscale-actor
Apr 15, 2026
Merged

[Data] Make Auto-downscaling Resource Aware#62574
bveeramani merged 14 commits into
ray-project:masterfrom
rayhhome:downscale-actor

Conversation

@rayhhome

@rayhhome rayhhome commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

Description

The DefaultActorAutoscaler in Ray Data does not consider resource allocation when making downscaling decisions. If an actor pool operator's resource allocation is reduced (e.g., by the ResourceManager rebalancing budgets), the actor pool can remain over-budget indefinitely as long as its utilization stays above the downscaling threshold.

This PR adds logic so that when the actor pool exceeds the resource allocation, it scales down regardless of its overall utilization.

Additional information

Added test_actor_pool_scaling_over_budget to check actor pools downscale when over their resource allocation.

Follow up: the current implementation of get_allocation() is buggy because the operator budget is clamped to 0 in update_budget. PR targeting the fix: #62649

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
@rayhhome rayhhome self-assigned this Apr 13, 2026
@rayhhome rayhhome requested a review from a team as a code owner April 13, 2026 21:11
@rayhhome rayhhome added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Apr 13, 2026
Copilot AI review requested due to automatic review settings April 13, 2026 21:11

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces budget-aware downscaling for actor pools, ensuring that pools scale down when they exceed resource allocations regardless of current utilization. The changes include a new budget check in the scaling logic and a helper function to calculate the required scale-down amount. Review feedback highlights a potential assertion failure in upscaling calculations when budgets are negative, suggests a less conservative approach to downscaling when actors are pending to enforce budgets more strictly, and recommends using a tolerance threshold for floating-point budget comparisons to prevent unnecessary churn. Additionally, there is a suggestion to refactor duplicated helper logic in the test suite to improve maintainability.

Comment thread python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Outdated
Comment thread python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Outdated
Comment thread python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Outdated
Comment thread python/ray/data/tests/test_autoscaler.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Ray Data’s DefaultActorAutoscaler to incorporate operator resource budgets into downscaling decisions so actor pools can shrink when they are over their allocated resources (even if utilization is high).

Changes:

  • Add an over-budget downscaling path to DefaultActorAutoscaler._derive_target_scaling_config.
  • Introduce _get_required_scale_down() to compute the necessary actor reduction based on remaining budget.
  • Add test_actor_pool_scaling_over_budget to validate over-budget downscaling behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Adds resource-budget-aware downscaling and helper to compute required scale-down.
python/ray/data/tests/test_autoscaler.py Adds a unit test covering downscaling behavior when the actor pool is over budget.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Outdated
Comment thread python/ray/data/tests/test_autoscaler.py Outdated
@rayhhome

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements resource budget enforcement within the DefaultActorAutoscaler, enabling downscaling when an actor pool exceeds its allocated resources. Key changes include the addition of a _get_required_scale_down helper function, budget clamping for scale-up operations, and updated unit tests for budget enforcement. Review feedback highlights a redundant call to fetch the resource budget and suggests refining the scale-down calculation by subtracting the epsilon value before applying math.ceil to prevent over-aggressive downscaling due to floating-point noise.

Comment thread python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Outdated
Comment thread python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Outdated
Comment thread python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Outdated
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 5b3123a. Configure here.

Comment thread python/ray/data/tests/test_autoscaler.py Outdated
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
@rayhhome rayhhome changed the title [Data] Make Autoscaling Resource Aware Apr 15, 2026
@rayhhome rayhhome changed the title [Data] Make Auto Downscaling Resource Aware Apr 15, 2026

Args:
actor_pool: The actor pool to scale down.
budget: The raw remaining budget (allocation - usage) (Note this is different

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't think it's clear what "raw" means in this context

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "net" in new comment. I also think (allocation - usage) could give future some idea what this budget represent.

Args:
actor_pool: The actor pool to scale down.
budget: The raw remaining budget (allocation - usage) (Note this is different
from the budget returned by `ResourceManager.get_budget()` and can be negative).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this is accurate anymore

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in new commit.

Comment on lines +328 to +344
per_actor = actor_pool.per_actor_resource_usage()

required_cpu_scale_down = 0
if per_actor.cpu > 0 and budget.cpu < 0:
required_cpu_scale_down = math.ceil(abs(budget.cpu) / per_actor.cpu)

required_gpu_scale_down = 0
if per_actor.gpu > 0 and budget.gpu < 0:
required_gpu_scale_down = math.ceil(abs(budget.gpu) / per_actor.gpu)

required_memory_scale_down = 0
if per_actor.memory > 0 and budget.memory < 0:
required_memory_scale_down = math.ceil(abs(budget.memory) / per_actor.memory)

return max(
required_cpu_scale_down, required_gpu_scale_down, required_memory_scale_down
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to re-use ExecutionOptions.floordiv here?

@rayhhome rayhhome Apr 15, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For scaling down, we have to use ceil instead of floor to conservatively cut down resource usage for correctness. It actually makes more sense to use ExecutionOptions.floordiv in _get_max_scale_up than in _get_required_scale_down. Should I just use it in _get_max_scale_up, or would it be better if I generalize ExecutionOptions.floordiv to be able to apply either floor or ceil based on the use case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporarily applied ExecutionOptions.floordiv for _get_max_scale_up.

@@ -235,7 +235,8 @@ def assert_autoscaling_action(
@pytest.fixture
def autoscaler_max_upscaling_delta_setup():

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to still add tests? I think we'd mock get_allocation anyway, so I don't think there are issues if it's off?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests in new commit.

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
@bveeramani bveeramani enabled auto-merge (squash) April 15, 2026 18:43
@bveeramani bveeramani merged commit 37e5527 into ray-project:master Apr 15, 2026
6 of 7 checks passed
HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026
## Description
The `DefaultActorAutoscaler` in Ray Data does not consider resource
allocation when making downscaling decisions. If an actor pool
operator's resource allocation is reduced (e.g., by the
`ResourceManager` rebalancing budgets), the actor pool can remain
over-budget indefinitely as long as its utilization stays above the
downscaling threshold.

This PR adds logic so that when the actor pool exceeds the resource
allocation, it scales down regardless of its overall utilization.

## Additional information
Added `test_actor_pool_scaling_over_budget` to check actor pools
downscale when over their resource allocation.

**Follow up: the current implementation of `get_allocation()` is buggy
because the operator budget is clamped to 0 in `update_budget`. PR
required to fix this!**

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
## Description
The `DefaultActorAutoscaler` in Ray Data does not consider resource
allocation when making downscaling decisions. If an actor pool
operator's resource allocation is reduced (e.g., by the
`ResourceManager` rebalancing budgets), the actor pool can remain
over-budget indefinitely as long as its utilization stays above the
downscaling threshold.

This PR adds logic so that when the actor pool exceeds the resource
allocation, it scales down regardless of its overall utilization.

## Additional information
Added `test_actor_pool_scaling_over_budget` to check actor pools
downscale when over their resource allocation.

**Follow up: the current implementation of `get_allocation()` is buggy
because the operator budget is clamped to 0 in `update_budget`. PR
required to fix this!**

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

3 participants