[Data] Make Auto-downscaling Resource Aware by rayhhome · Pull Request #62574 · ray-project/ray

rayhhome · 2026-04-13T21:11:16Z

Description

The DefaultActorAutoscaler in Ray Data does not consider resource allocation when making downscaling decisions. If an actor pool operator's resource allocation is reduced (e.g., by the ResourceManager rebalancing budgets), the actor pool can remain over-budget indefinitely as long as its utilization stays above the downscaling threshold.

This PR adds logic so that when the actor pool exceeds the resource allocation, it scales down regardless of its overall utilization.

Additional information

Added test_actor_pool_scaling_over_budget to check actor pools downscale when over their resource allocation.

Follow up: the current implementation of get_allocation() is buggy because the operator budget is clamped to 0 in update_budget. PR targeting the fix: #62649

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces budget-aware downscaling for actor pools, ensuring that pools scale down when they exceed resource allocations regardless of current utilization. The changes include a new budget check in the scaling logic and a helper function to calculate the required scale-down amount. Review feedback highlights a potential assertion failure in upscaling calculations when budgets are negative, suggests a less conservative approach to downscaling when actors are pending to enforce budgets more strictly, and recommends using a tolerance threshold for floating-point budget comparisons to prevent unnecessary churn. Additionally, there is a suggestion to refactor duplicated helper logic in the test suite to improve maintainability.

Copilot

Pull request overview

Updates Ray Data’s DefaultActorAutoscaler to incorporate operator resource budgets into downscaling decisions so actor pools can shrink when they are over their allocated resources (even if utilization is high).

Changes:

Add an over-budget downscaling path to DefaultActorAutoscaler._derive_target_scaling_config.
Introduce _get_required_scale_down() to compute the necessary actor reduction based on remaining budget.
Add test_actor_pool_scaling_over_budget to validate over-budget downscaling behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py	Adds resource-budget-aware downscaling and helper to compute required scale-down.
python/ray/data/tests/test_autoscaler.py	Adds a unit test covering downscaling behavior when the actor pool is over budget.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

rayhhome · 2026-04-13T23:12:49Z

/gemini review

gemini-code-assist

Code Review

This pull request implements resource budget enforcement within the DefaultActorAutoscaler, enabling downscaling when an actor pool exceeds its allocated resources. Key changes include the addition of a _get_required_scale_down helper function, budget clamping for scale-up operations, and updated unit tests for budget enforcement. Review feedback highlights a redundant call to fetch the resource budget and suggests refining the scale-down calculation by subtracting the epsilon value before applying math.ceil to prevent over-aggressive downscaling due to floating-point noise.

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

…ale-actor

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Reviewed by Cursor Bugbot for commit 5b3123a. Configure here.}

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

bveeramani · 2026-04-15T03:20:48Z

+
+    Args:
+        actor_pool: The actor pool to scale down.
+        budget: The raw remaining budget (allocation - usage) (Note this is different


Nit: I don't think it's clear what "raw" means in this context

Changed to "net" in new comment. I also think (allocation - usage) could give future some idea what this budget represent.

bveeramani · 2026-04-15T03:20:56Z

+    Args:
+        actor_pool: The actor pool to scale down.
+        budget: The raw remaining budget (allocation - usage) (Note this is different
+            from the budget returned by `ResourceManager.get_budget()` and can be negative).


Don't think this is accurate anymore

Addressed in new commit.

bveeramani · 2026-04-15T03:21:32Z

+    per_actor = actor_pool.per_actor_resource_usage()
+
+    required_cpu_scale_down = 0
+    if per_actor.cpu > 0 and budget.cpu < 0:
+        required_cpu_scale_down = math.ceil(abs(budget.cpu) / per_actor.cpu)
+
+    required_gpu_scale_down = 0
+    if per_actor.gpu > 0 and budget.gpu < 0:
+        required_gpu_scale_down = math.ceil(abs(budget.gpu) / per_actor.gpu)
+
+    required_memory_scale_down = 0
+    if per_actor.memory > 0 and budget.memory < 0:
+        required_memory_scale_down = math.ceil(abs(budget.memory) / per_actor.memory)
+
+    return max(
+        required_cpu_scale_down, required_gpu_scale_down, required_memory_scale_down
+    )


Would it make sense to re-use ExecutionOptions.floordiv here?

For scaling down, we have to use ceil instead of floor to conservatively cut down resource usage for correctness. It actually makes more sense to use ExecutionOptions.floordiv in _get_max_scale_up than in _get_required_scale_down. Should I just use it in _get_max_scale_up, or would it be better if I generalize ExecutionOptions.floordiv to be able to apply either floor or ceil based on the use case?

Temporarily applied ExecutionOptions.floordiv for _get_max_scale_up.

bveeramani · 2026-04-15T03:38:40Z

@@ -235,7 +235,8 @@ def assert_autoscaling_action(
 @pytest.fixture
 def autoscaler_max_upscaling_delta_setup():


Would it be possible to still add tests? I think we'd mock get_allocation anyway, so I don't think there are issues if it's off?

Added tests in new commit.

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

## Description The `DefaultActorAutoscaler` in Ray Data does not consider resource allocation when making downscaling decisions. If an actor pool operator's resource allocation is reduced (e.g., by the `ResourceManager` rebalancing budgets), the actor pool can remain over-budget indefinitely as long as its utilization stays above the downscaling threshold. This PR adds logic so that when the actor pool exceeds the resource allocation, it scales down regardless of its overall utilization. ## Additional information Added `test_actor_pool_scaling_over_budget` to check actor pools downscale when over their resource allocation. **Follow up: the current implementation of `get_allocation()` is buggy because the operator budget is clamped to 0 in `update_budget`. PR required to fix this!** --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Initial change by adding scale down logic and test

59d0f23

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

rayhhome self-assigned this Apr 13, 2026

rayhhome requested a review from a team as a code owner April 13, 2026 21:11

rayhhome added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Apr 13, 2026

Copilot AI review requested due to automatic review settings April 13, 2026 21:11

Copilot started reviewing on behalf of rayhhome April 13, 2026 21:11 View session

gemini-code-assist Bot reviewed Apr 13, 2026

View reviewed changes

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Comment thread python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Outdated

Comment thread python/ray/data/tests/test_autoscaler.py Outdated

rayhhome added 3 commits April 13, 2026 14:43

Merge branch 'master' into downscale-actor

574d9fe

Merge branch 'master' into downscale-actor

f7c1da2

Address comments and minor fixes

c33d23d

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

gemini-code-assist Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Outdated

Comment thread python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Outdated

Comment thread python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Outdated

rayhhome added 7 commits April 13, 2026 16:53

Account for logical memory usage + Minor fixes + Remove epsilon

701b558

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Switch to using raw budget for measuring autoscaling factor

3606242

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Added test for cross resource case

b0f0821

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into downscale-actor

6b8593c

Make get_raw_budget correctly abstract

792c483

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'downscale-actor' of github.com:rayhhome/ray into downsc…

e20bd5a

…ale-actor

Switch back to using get_allocation and get_op_usage

5b3123a

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

cursor Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py

Comment thread python/ray/data/tests/test_autoscaler.py Outdated

Revert autoscaler changes

a1bd435

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

rayhhome force-pushed the downscale-actor branch from f3a5ed0 to a1bd435 Compare April 15, 2026 02:21

rayhhome changed the title ~~[Data] Make Autoscaling Resource Aware~~ Apr 15, 2026

rayhhome changed the title ~~[Data] Make Auto Downscaling Resource Aware~~ Apr 15, 2026

bveeramani reviewed Apr 15, 2026

View reviewed changes

rayhhome added 2 commits April 15, 2026 11:06

Address comments

496fbca

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into downscale-actor

651852b

bveeramani approved these changes Apr 15, 2026

View reviewed changes

bveeramani enabled auto-merge (squash) April 15, 2026 18:43

bveeramani merged commit 37e5527 into ray-project:master Apr 15, 2026
6 of 7 checks passed

rayhhome mentioned this pull request Apr 16, 2026

[Data] Fix Incorrect get_allocation Calculation and Refactor Resource Manager Allocation Strategy #62649

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Make Auto-downscaling Resource Aware#62574

[Data] Make Auto-downscaling Resource Aware#62574
bveeramani merged 14 commits into
ray-project:masterfrom
rayhhome:downscale-actor

rayhhome commented Apr 13, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

rayhhome commented Apr 13, 2026

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

bveeramani Apr 15, 2026

rayhhome Apr 15, 2026

bveeramani Apr 15, 2026

rayhhome Apr 15, 2026

bveeramani Apr 15, 2026

rayhhome Apr 15, 2026 •

edited

Loading

rayhhome Apr 15, 2026

bveeramani Apr 15, 2026

rayhhome Apr 15, 2026

Uh oh!

Labels

3 participants

		@@ -235,7 +235,8 @@ def assert_autoscaling_action(
		@pytest.fixture
		def autoscaler_max_upscaling_delta_setup():

Uh oh!

Conversation

rayhhome commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional information

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

rayhhome commented Apr 13, 2026

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rayhhome Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants

rayhhome commented Apr 13, 2026 •

edited

Loading

rayhhome Apr 15, 2026 •

edited

Loading