Skip to content

[Serve] Add broadcast API for deployment handles that broadcasts the same RPC across all live replicas of a deployment#61472

Merged
kouroshHakha merged 38 commits into
ray-project:masterfrom
Crystora:feat/serve-broadcast-deployment-handle
Apr 7, 2026
Merged

[Serve] Add broadcast API for deployment handles that broadcasts the same RPC across all live replicas of a deployment#61472
kouroshHakha merged 38 commits into
ray-project:masterfrom
Crystora:feat/serve-broadcast-deployment-handle

Conversation

@Crystora

@Crystora Crystora commented Mar 4, 2026

Copy link
Copy Markdown
Contributor

Description

Adds a broadcast() method to DeploymentHandle that calls a method on every replica of a deployment. This is useful when you need to hit all replicas - e.g. clearing a cache, reloading config, or resetting state - without relying on private internals.

Currently there's no public way to do this. Users have to dig into _router._asyncio_router._request_router to get replica references, which is fragile and breaks across versions.

Related issues

Fixes #55654

Additional information

API:

handle = serve.get_deployment_handle("MyDeployment", "default")
response = handle.broadcast("reset_cache")
results = response.results(timeout_s=10)

What changed:

  • DeploymentHandle.broadcast(method_name, *args, **kwargs) - public method that bypasses load balancing and sends to all replicas directly
  • DeploymentBroadcastResponse - response wrapper with results() (sync) and results_async() (async) to collect results from every replica
  • Router.broadcast() - abstract method implemented across all router subclasses (SingletonThreadRouter, CurrentLoopRouter, LocalRouter)
  • Added integration tests covering basic broadcast, args/kwargs, stateful mutations, async path, and single-replica edge case

Follow-up fixes (co-author: @kouroshHakha):

  • Fixed coroutine single-use bug in DeploymentBroadcastResponse._fetch_replica_results_sync: in local testing mode (loop=None), if the broadcast coroutine raised an exception, calling results() a second time would crash with RuntimeError: cannot reuse already awaited coroutine. The fix caches the outcome in a concurrent.futures.Future so retries replay the cached exception.
  • Added .. warning:: to broadcast() docstring noting it bypasses max_queued_requests backpressure and is intended for infrequent control-plane operations only.
  • Added note to DeploymentBroadcastResponse docstring clarifying results only include replicas live at the time broadcast() was called.
  • Added unit tests for the coroutine-reuse fix in TestDeploymentBroadcastResponseLocalTesting.
@Crystora Crystora requested a review from a team as a code owner March 4, 2026 01:55
Signed-off-by: bittoby <bittoby@users.noreply.github.com>
@Crystora Crystora force-pushed the feat/serve-broadcast-deployment-handle branch from 0f0d5ff to 887ef8f Compare March 4, 2026 01:56

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a broadcast() method to DeploymentHandle, allowing a method to be called on all replicas of a deployment. This is a valuable feature for operations like cache clearing or state synchronization. However, the current implementation introduces significant security and stability concerns. Specifically, DeploymentBroadcastResponse.results() can block the asyncio event loop and may crash when used with CurrentLoopRouter. More critically, the broadcast() implementation in the router bypasses max_queued_requests and replica-side rejection, posing a risk of Denial of Service by allowing an attacker to overwhelm all replicas. Additionally, the broadcast implementation in the local testing router is blocking, which is inconsistent with its asynchronous interface. The main AsyncioRouter implementation of broadcast also fails to resolve arguments from other handle calls and does not propagate the tracing context, which are significant omissions.

Comment thread python/ray/serve/_private/local_testing_mode.py
Comment thread python/ray/serve/_private/router.py
Comment thread python/ray/serve/handle.py Outdated
Comment thread python/ray/serve/handle.py Outdated
Comment thread python/ray/serve/handle.py Outdated
@Crystora Crystora changed the title feat: Add to natively call a method on all replicas of a deployment Mar 4, 2026
bittoby added 3 commits March 4, 2026 02:22
…, and event loop safety

Signed-off-by: bittoby <bittoby@users.noreply.github.com>
Comment thread python/ray/serve/_private/router.py
Comment thread python/ray/serve/handle.py
Signed-off-by: bittoby <bittoby@users.noreply.github.com>
Comment thread python/ray/serve/handle.py Outdated
Comment thread python/ray/serve/_private/router.py
…rg copies, and request counter

Signed-off-by: bittoby <bittoby@users.noreply.github.com>
Comment thread python/ray/serve/_private/router.py
bittoby added 2 commits March 4, 2026 02:48
Signed-off-by: bittoby <bittoby@users.noreply.github.com>
@Crystora

Crystora commented Mar 4, 2026

Copy link
Copy Markdown
Contributor Author

@kouroshHakha @bveeramani could you please review this PR? Welcome to any feedback.
thanks

@bveeramani

Copy link
Copy Markdown
Member

Hey @bittoby , thanks for the PR! I don't have context on the Serve codebase so I can't review this.

Will defer to the Serve code owners

@kouroshHakha kouroshHakha requested a review from abrarsheikh March 4, 2026 03:15
@kouroshHakha

Copy link
Copy Markdown
Contributor

deferring to @abrarsheikh for review

bittoby added 2 commits March 4, 2026 03:36
Signed-off-by: bittoby <bittoby@users.noreply.github.com>
@Crystora Crystora force-pushed the feat/serve-broadcast-deployment-handle branch from 8a92e5c to 0ba79be Compare March 4, 2026 03:37
Comment thread python/ray/serve/_private/router.py
Signed-off-by: bittoby <bittoby@users.noreply.github.com>
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels Mar 4, 2026
@Crystora

Crystora commented Mar 6, 2026

Copy link
Copy Markdown
Contributor Author

@abrarsheikh could you please review this PR? I'd appreciate any feedback

@harshit-anyscale

Copy link
Copy Markdown
Contributor

@bittoby can you please fix the lint issues and test coverage failures?

@harshit-anyscale harshit-anyscale added the go add ONLY when ready to merge, run all tests label Mar 6, 2026
@Crystora Crystora requested a review from eicherseiji March 19, 2026 05:02
Comment thread python/ray/serve/handle.py
Comment thread python/ray/serve/handle.py
bittoby added 2 commits March 19, 2026 11:19
…ponse

Signed-off-by: bittoby <bittoby@users.noreply.github.com>
Comment thread python/ray/serve/handle.py
Signed-off-by: bittoby <bittoby@users.noreply.github.com>
@Crystora Crystora force-pushed the feat/serve-broadcast-deployment-handle branch from 1b93735 to 4dc3f7f Compare March 19, 2026 11:48
@Crystora

Copy link
Copy Markdown
Contributor Author

Update! Please review again @eicherseiji
thank you

@eicherseiji eicherseiji left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me, but would like @abrarsheikh's feedback as well. @bittoby please fix lint.

bittoby added 2 commits March 19, 2026 18:49
Signed-off-by: bittoby <bittoby@users.noreply.github.com>
@Crystora Crystora force-pushed the feat/serve-broadcast-deployment-handle branch from 70208f2 to b23256e Compare March 19, 2026 18:50
bittoby added 2 commits March 20, 2026 12:28
Signed-off-by: bittoby <bittoby@users.noreply.github.com>
@Crystora Crystora force-pushed the feat/serve-broadcast-deployment-handle branch from c9b5357 to 6a46e00 Compare March 20, 2026 12:31
@kouroshHakha kouroshHakha changed the title [Serve] Add to natively call a method on all replicas of a deployment Apr 6, 2026
@kouroshHakha

Copy link
Copy Markdown
Contributor

@cursoragent can you update the serve docs to reflect this feature?

@Crystora

Crystora commented Apr 6, 2026

Copy link
Copy Markdown
Contributor Author

Hi, @kouroshHakha what else should I need to update? It took long time since I submitted this PR. would you please merge this PR? thanks

@kouroshHakha kouroshHakha force-pushed the feat/serve-broadcast-deployment-handle branch from 3b80c4c to 63a58aa Compare April 6, 2026 21:03
…docs

- Cache the coroutine outcome in _scheduled_future for the local testing
  path (loop=None) so a second call to results() replays the cached
  exception instead of crashing with "cannot reuse already awaited
  coroutine".
- Add warning admonition to broadcast() docstring about bypassing
  max_queued_requests backpressure.
- Add note to DeploymentBroadcastResponse docstring clarifying that
  results only include replicas live at call time.
- Add WORKER_STARTUP_FAILED to ERROR_TYPE to fix proto enum sync.

Made-with: Cursor
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Made-with: Cursor
@kouroshHakha kouroshHakha force-pushed the feat/serve-broadcast-deployment-handle branch from 63a58aa to a6952c7 Compare April 6, 2026 23:29

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit a6952c7. Configure here.

finally:
tmp_loop.close()
self._scheduled_future = cached_future
self._replica_results = self._scheduled_future.result(timeout=timeout_s)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout ignored during local-testing-mode initial fetch

Low Severity

In _fetch_replica_results_sync, when self._loop is None (local testing mode), tmp_loop.run_until_complete(self._coro) blocks indefinitely without honoring timeout_s. The timeout is only applied to self._scheduled_future.result(timeout=timeout_s), but by that point the future is already resolved. This means results(timeout_s=...) can hang forever in local testing if the broadcast coroutine itself blocks.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a6952c7. Configure here.

@kouroshHakha kouroshHakha enabled auto-merge (squash) April 7, 2026 00:36
@kouroshHakha kouroshHakha merged commit ede2730 into ray-project:master Apr 7, 2026
7 checks passed
@Crystora

Crystora commented Apr 7, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @kouroshHakha 👍

@Crystora Crystora deleted the feat/serve-broadcast-deployment-handle branch April 7, 2026 02:02
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…same RPC across all live replicas of a deployment (ray-project#61472)

Signed-off-by: bittoby <bittoby@users.noreply.github.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: bittoby <bittoby@users.noreply.github.com>
Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

5 participants