[Data] Non-blocking Default Autoscaling Coordinator by rayhhome · Pull Request #62725 · ray-project/ray

rayhhome · 2026-04-17T19:20:59Z

Description

Problem: get_allocated_resources was called every ~1s from the scheduling loop but used a blocking ray.get(), so any actor queue delay or result transfer latency directly stalled dataset execution.

This PR makes all three DefaultAutoscalingCoordinator public methods non-blocking:

get_allocated_resources: fires a background request and immediately returns the last cached value; the cache is updated when the response arrives on the next loop iteration.
request_resources and cancel_request: fire-and-forget; only send the request to the backing AutoscalingCoordinatorActor without any result observation.

Additional information

This change is backed by #62838: AutoscalingCoordinator abstract methods no longer take requester_id; DefaultAutoscalingCoordinator.__init__ now requires it.

The underlying _AutoscalingCoordinatorActor is unchanged.

Unit tests cover behaviors independently: in-flight caching, cache update on success, actor-error fallback to cached value, non-Ray error propagation, and client-side state cleanup on cancel.

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Copilot

Pull request overview

This PR updates Ray Data’s DefaultAutoscalingCoordinator.get_allocated_resources() to avoid blocking the scheduling loop by issuing the actor RPC asynchronously and returning the last cached allocation until the in-flight response is ready.

Changes:

Add per-requester in-flight tracking (_pending_allocated_resources) and cached fallback behavior for non-blocking get_allocated_resources().
Introduce unit tests for in-flight caching, cache update on success, and failure-counter behavior for actor errors and timeouts.
Update an existing test to poll until the async get_allocated_resources() result arrives.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py	Implements non-blocking `get_allocated_resources()` with pending-request tracking, cancellation on timeout, and failure tracking/logging.
python/ray/data/tests/test_autoscaling_coordinator.py	Adds unit tests for the new async/cached behavior and updates an existing test to poll for eventual allocation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T19:25:07Z

+                    result = ray.get(ref)
+                    self._cached_allocated_resources[requester_id] = result
+                    self._consecutive_failures_get_allocated_resources = 0
+                except Exception as exc:


get_allocated_resources() currently catches Exception from ray.get(ref) and converts it into cached fallback behavior via _record_get_allocated_resources_failure(). This swallows non-Ray/programming errors (e.g., TypeError/ValueError) that should surface to the caller, and it also conflicts with the PR description’s claim that non-Ray errors are propagated. Consider only treating Ray-originated failures as recoverable here (e.g., catch ray.exceptions.RayError / relevant Ray exceptions), and re-raise unexpected exceptions after cleaning up _pending_allocated_resources so real bugs aren’t masked.

Suggested change

except Exception as exc:

except ray.exceptions.RayError as exc:

gemini-code-assist

Code Review

This pull request refactors get_allocated_resources in DefaultAutoscalingCoordinator to be non-blocking, utilizing a cache and tracking in-flight asynchronous requests. The review feedback recommends using typing.Tuple for Python 3.8 compatibility and time.monotonic() for reliable duration tracking.

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 7e0cb1e. Configure here.}

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

…llocated_resources Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

…et-time-out

…or get_allocated_resources Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

…Coordinator Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Signed-off-by: HFFuture <ray.huang@anyscale.com>

bveeramani

Implementation LGTM. Left comments on tests

bveeramani · 2026-04-22T17:55:41Z

+    Single-tenant: every instance is owned by exactly one
+    DefaultClusterAutoscalerV2 and is called with a single, fixed requester_id
+    for its entire lifetime. Violating this is undefined behavior.


I felt confused by this sentence. What does multi-tenancy mean in this context? I don't the jargon helps clarify the use of the class

I don't think we should state that it's owned by DefaultClusterAutoscalerV2 because the caller can be any Ray Data autoscaler implementation or even Ray Train. In general, I don't think we should make strong assumptions about who the caller is

Updated comments to more accurately and directly describe the use of the DefaultAutoscalingCoordinator class.

bveeramani · 2026-04-22T17:55:53Z

+    DefaultClusterAutoscalerV2 and is called with a single, fixed requester_id
+    for its entire lifetime. Violating this is undefined behavior.
+    ``get_allocated_resources`` tracks a single in-flight ref and falls back
+    to the cached value on actor errors; ``request_resources`` and


On actor error or if the value isn't ready?

Both; the new docstring should explain the behavior more clearly.

bveeramani · 2026-04-22T17:58:00Z

+        """Fire-and-forget: submit a resource request to the coordinator actor.
+
+        Returns immediately without observing the result or errors. Actor-side
+        errors (e.g. type mismatches) are not surfaced to the caller.


What type mismatch can happen? Like invalid inputs?

Invalid inputs could lead to type mismatch. Upon review, this example seems too specific as an example for the docstring, considering there are much more meaningful designed ValueErrors that _AutoscalingCoordinatorActor can emit. Since all errors from the actor are swallowed in the current implementation, I've removed the example in the new commit.

bveeramani · 2026-04-22T18:05:03Z

+            if ready:
+                self._pending_allocated_resources = None
+                try:
+                    self._cached_allocated_resources = ray.get(ref)


We need a timeout here.

If the the wait returns the reference as ready, and then the actor dies from CPU overload or something, then this ray.get will hang until the actor gets reconstructed.

Unlikely but good to be safe.

Fixed this using a timeout of 0 for get.

bveeramani · 2026-04-22T19:42:32Z

-    call_method,
-    counter_attr,
-    error_msg_prefix,
+def _make_coordinator_with_mock_actor():


The tests in this file will be brittle because they assert against and mock several internal attributes. They're also a bit hard to read without understanding the implementation.

Here's a sketch of how you could refactor these:

Expose a seam so that we can pass in the autoscaling coordinator actor to the client. This allows us to test against an actual actor implementation (and avoid a heavy mock) without using the shared state of the default named actor.

Test only against the public methods (e.g., call request_resources and wait for that to be consistent rather than mocking _cached_allocated_resources)

Minimize mocking to just ray.wait/ray.get. This still makes some assumptions about implementation, but I think that might be unavoidable

@pytest.fixture def autoscaling_coordinator_actor(ray_start_regular_shared): actor_cls = ray.remote(num_cpus=0)(_AutoscalingCoordinatorActor) actor = actor_cls.remote( send_resources_request=lambda b: None, get_cluster_nodes=lambda: [ {"Alive": True, "Resources": {"CPU": 4}, "NodeID": "n1"} ], ) yield actor ray.kill(actor) def test_get_allocated_resources_eventually_consistent(autoscaling_coordinator_actor): coordinator = DefaultAutoscalingCoordinator( requester_id="test", autoscaling_coordinator_actor=autoscaling_coordinator_actor, ) coordinator.request_resources(resources=[{"CPU": 1}], expire_after_s=60) wait_for_condition(lambda: coordinator.get_allocated_resources("test") == [{"CPU": 1}], timeout=5) def test_get_allocated_resources_returns_cached_while_pending(autoscaling_coordinator_actor, monkeypatch): coordinator = DefaultAutoscalingCoordinator( requester_id="test", autoscaling_coordinator_actor=autoscaling_coordinator_actor, ) coordinator.request_resources(resources=[{"CPU": 1}], expire_after_s=30) wait_for_condition( lambda: coordinator.get_allocated_resources("test") == [{"CPU": 1}], timeout=5 ) # Make ray.wait report all refs as pending. def fake_wait(refs, *args, **kwargs): return [], refs monkeypatch.setattr(ray, "wait", fake_wait) coordinator.request_resources(resources=[{"CPU": 2}], expire_after_s=30) # Should return the stale cached value, not block or error. result = coordinator.get_allocated_resources("test") assert result == [{"CPU": 1}] etc. for other interface (not implementation) behaviors

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

…et-time-out

bveeramani · 2026-04-24T00:32:58Z

+        if autoscaling_coordinator_actor is not None:
+            # Bypass the cached_property by injecting the actor directly.
+            # Used in tests to avoid the shared named actor.
+            self.__dict__["_autoscaling_coordinator"] = autoscaling_coordinator_actor


Think we should avoid Python magic unless absolutely necessary.

This could be written alternatively like this:

def __init__(...): self._autoscaling_coordinator = autoscaling_coordinator def _get_or_create_autoscaling_coordinator(...): if self._autoscaling_coordinator is None: self._autoscaling_coordinator = # Create named actor return self._autoscaling_coordinator def request_resources(...): autoscaling_coordinator = self._get_or_create_autoscaling_coordinator(...)

## Description Problem: `get_allocated_resources` was called every ~1s from the scheduling loop but used a blocking `ray.get()`, so any actor queue delay or result transfer latency directly stalled dataset execution. This PR makes `get_allocated_resources` non-blocking: it fires the remote call in the background and immediately returns the last cached value, updating the cache when the response arrives on the next loop step. The first call for a new requester returns [] while the initial response is in-flight, resolving ~1s later. ## Additional information Currently, only `DefaultAutoscalingCoordinator.get_allocated_resources` is made non-blocking. `request_resources` and `cancel_request` remain blocking since they are not on the hot path. Unit tests cover each behavior independently: in-flight caching, cache update on success, non-Ray error propagation, and failure counter escalation for both actor exceptions and timeouts. --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com> Signed-off-by: HFFuture <ray.huang@anyscale.com>

… persistently starved (#63969) ## Description A dataset's resource allocator depends on the `AutoscalingCoordinator` server to get its share of allocated resources. To improve reliability, #62725 made calls to the server non-blocking. One consequence of this change is that the dataset gets zero resources at the very start of execution while it waits for the first response from the autoscaling coordnanator. As a result, we'd consistently emit spurious warnings like this at the start of execution: ``` Cluster resources are not enough to run any task from TaskPoolMapOperator[ReadRange]. The job may hang forever unless the cluster scales up. ``` To avoid this confusion, I've made it so that we only emit the warning after the first eligible operator has been starved for a minute. ## Related issues ## Additional information --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

… persistently starved (ray-project#63969) ## Description A dataset's resource allocator depends on the `AutoscalingCoordinator` server to get its share of allocated resources. To improve reliability, ray-project#62725 made calls to the server non-blocking. One consequence of this change is that the dataset gets zero resources at the very start of execution while it waits for the first response from the autoscaling coordnanator. As a result, we'd consistently emit spurious warnings like this at the start of execution: ``` Cluster resources are not enough to run any task from TaskPoolMapOperator[ReadRange]. The job may hang forever unless the cluster scales up. ``` To avoid this confusion, I've made it so that we only emit the warning after the first eligible operator has been starved for a minute. ## Related issues ## Additional information --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

rayhhome added 2 commits April 17, 2026 11:57

Make get_allocated_resources async + new test cases

0626871

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Remove unrealistic error handling

028bbef

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

rayhhome self-assigned this Apr 17, 2026

rayhhome added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Apr 17, 2026

rayhhome requested a review from a team as a code owner April 17, 2026 19:21

Copilot AI review requested due to automatic review settings April 17, 2026 19:21

Merge branch 'master' into ray-get-time-out

ff58fb4

Copilot started reviewing on behalf of rayhhome April 17, 2026 19:21 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 17, 2026

View reviewed changes

rayhhome added 2 commits April 17, 2026 17:25

Make cancel_request and request_resources async

c7b682f

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Handle potential failure in ray_cancel

7e0cb1e

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

cursor Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py Outdated

rayhhome added 9 commits April 17, 2026 17:56

Merge branch 'master' into ray-get-time-out

5abf0db

Clear all requester states in cancel_request + comment improvements

4f721bb

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into ray-get-time-out

151e5ce

Merge branch 'master' into ray-get-time-out

2df2546

Refactor request methods to fire-and-forget and use polling for get_a…

d27b9d5

…llocated_resources Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'ray-get-time-out' of github.com:rayhhome/ray into ray-g…

867d2e5

…et-time-out

Further Refactor request methods to fire-and-forget and use polling f…

95274d1

…or get_allocated_resources Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Condense get_allocated_resource

8f23b56

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into ray-get-time-out

42b2967

rayhhome changed the title ~~[Data] Non-blocking get_allocated_resources~~ Apr 21, 2026

rayhhome added 6 commits April 21, 2026 12:15

Remove environment variable and timeout for get_allocated_resources

3f37955

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into ray-get-time-out

7fc8001

Get rid of extraneous requester_id in autoscaler

88a964d

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into requester-id-in-dac

0454a47

Update _cached_allocated_resources to list + simplify FakeAutoscaling…

c5032e9

…Coordinator Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'requester-id-in-dac' into ray-get-time-out

d9ff6e7

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

rayhhome added 2 commits April 21, 2026 15:45

Merge branch 'master' into ray-get-time-out

5b46895

Signed-off-by: HFFuture <ray.huang@anyscale.com>

Merge branch 'master' into ray-get-time-out

9021bb9

bveeramani reviewed Apr 22, 2026

View reviewed changes

rayhhome added 3 commits April 22, 2026 17:15

Address comments for refactoring tests and better docstrings

994ec78

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into ray-get-time-out

fd00fd0

Merge branch 'ray-get-time-out' of github.com:rayhhome/ray into ray-g…

2ec26f1

…et-time-out

bveeramani approved these changes Apr 24, 2026

View reviewed changes

bveeramani merged commit 1112d90 into ray-project:master Apr 24, 2026
5 of 6 checks passed

rayhhome deleted the ray-get-time-out branch April 27, 2026 20:46

bveeramani mentioned this pull request Jun 9, 2026

[Data] Delay 'cluster resources not enough' warning until operator is persistently starved #63969

Merged

bveeramani mentioned this pull request Jun 11, 2026

[Data] AutoscalingCoordinator._reallocate_resources is O(R·N²) and holds its lock, starving concurrent datasets at scale (op_budget→0) #63924

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Non-blocking Default Autoscaling Coordinator#62725

[Data] Non-blocking Default Autoscaling Coordinator#62725
bveeramani merged 25 commits into
ray-project:masterfrom
rayhhome:ray-get-time-out

rayhhome commented Apr 17, 2026 •

edited

Loading

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

bveeramani left a comment

bveeramani Apr 22, 2026

rayhhome Apr 23, 2026

bveeramani Apr 22, 2026

rayhhome Apr 23, 2026

bveeramani Apr 22, 2026

rayhhome Apr 23, 2026

bveeramani Apr 22, 2026

rayhhome Apr 23, 2026

bveeramani Apr 22, 2026 •

edited

Loading

rayhhome Apr 23, 2026

bveeramani Apr 24, 2026

Uh oh!

Labels

3 participants

	except Exception as exc:
	except ray.exceptions.RayError as exc:

Uh oh!

Conversation

rayhhome commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional information

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bveeramani Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants

rayhhome commented Apr 17, 2026 •

edited

Loading

bveeramani Apr 22, 2026 •

edited

Loading