Skip to content

[serve][1/3] Add replica-side slot reservation primitive#63252

Merged
kouroshHakha merged 6 commits into
masterfrom
decouple-routing-primitives-1
May 10, 2026
Merged

[serve][1/3] Add replica-side slot reservation primitive#63252
kouroshHakha merged 6 commits into
masterfrom
decouple-routing-primitives-1

Conversation

@jeffreywang88

@jeffreywang88 jeffreywang88 commented May 9, 2026

Copy link
Copy Markdown
Contributor

Description

The choose_replica / dispatch flow needs the router to hold a slot on the replica before dispatching, so that when dispatch finally sends the request it can use with_rejection=False (slot already reserved on the actor). This PR adds the underlying primitive: an actor-side reserve_slot / release_slot pair backed by the existing semaphore, plus the wire format and ReplicaSelection wrapper. There are no callers yet — the router integration is the next PR.

Lifecycle

This PR adds the primitives a future caller (implemented in #63254 and #63255) will use as a four-step lifecycle:

  1. Reserve: Grab a permit on the replica's max_ongoing_requests semaphore before the request is built; get back a token and ground-truth (accepted, num_ongoing_requests).
  2. Hand off: Wrap the token in a ReplicaSelection (address + node + AZ) the caller can pass around.
  3. Consume on dispatch: The token rides on RequestMetadata; the replica skips re-acquiring the semaphore on the way in.
  4. Release on abort: Early return / exception / cancellation returns the token and frees the permit.

Primitives

  • Replica.reserve_slot / release_slot (exposed through ReplicaActor): the actor-side primitive, leveraging the existing _start_request semaphore so reservations count against the same capacity bound and show up in get_num_ongoing_requests().
    • Cross-language (Java) replicas raise RuntimeError("Slot reservation not supported for Java.") since there's no actor-side semaphore on the Java replica.
  • RunningReplica.reserve_slot / release_slot: async wrappers over the actor RPC that return ReplicaQueueLengthInfo for cache updates.
  • ReplicaSelection: wraps the token plus replica metadata; enforces single dispatch via _mark_dispatched.
  • ReplicaUnavailableError: raised when a selection is invalidated before dispatch.

FAQ

1. What's still missing in this PR?
Gap Fixed in
Nothing calls reserve_slot yet — the primitive exists but is unwired. #63254
RunningReplica.reserve_slot doesn't yet retry on ActorDiedError / ActorUnavailableError. #63254 caller in AsyncioRouter.choose_replica
Router abstract base, SingletonThreadRouter, CurrentLoopRouter, and DeploymentHandle have no choose_replica / dispatch methods yet. #63255
2. Why do we need both token and replica-side semaphore?

Reservation creates a gap: reserve_slot and dispatch are two separate router → actor RPCs, and the design needs to handle both capacity and identity across that gap.

The semaphore is the capacity gate. It's anonymous, shared with every other entry path (handle_request, handle_request_with_rejection), so all paths agree on what "at capacity" means.

The token is the reservation's identity. The semaphore alone can't tell you:

  • Whether an incoming dispatch should acquire a fresh slot or consume one already held — _start_request branches on the token's presence.
  • Whether a request that asked to skip semaphore acquisition is legitimately consuming a reservation, vs. forging past the capacity gate.

Related issues

RFC: #59792
Original PR: #60865
Next PR: #63254

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@jeffreywang88 jeffreywang88 added the go add ONLY when ready to merge, run all tests label May 9, 2026
@jeffreywang88 jeffreywang88 force-pushed the decouple-routing-primitives-1 branch from 738e50b to 18d7ec5 Compare May 9, 2026 03:58

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a slot reservation mechanism for Ray Serve replicas to manage capacity during the choose-and-dispatch process. Key changes include adding reserve_slot and release_slot methods to the replica implementation, updating RequestMetadata to track reservation tokens, and introducing a ReplicaSelection class to manage the reservation lifecycle. Additionally, a new ReplicaUnavailableError is added. Feedback was provided regarding a potential blocking call in the reserve_slot method that could occur if a race condition happens between the capacity check and semaphore acquisition.

Comment on lines +1200 to +1203
if not self._can_accept_request(request_metadata):
return False, self.get_num_ongoing_requests()

await self._semaphore.acquire()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The reserve_slot method checks _can_accept_request before awaiting the semaphore. While this provides an early exit, await self._semaphore.acquire() can still block if a race condition occurs between the check and the acquisition. If the intention is for the router to never block on a specific replica during selection, consider using a non-blocking acquisition or a timeout.

Co-Authored-By: machichima <nary12321@gmail.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…thInfo wrapper)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang88 jeffreywang88 marked this pull request as ready for review May 9, 2026 05:58
@jeffreywang88 jeffreywang88 requested a review from a team as a code owner May 9, 2026 05:58

# Internal fields (not part of public API)
_replica: RunningReplica
_deployment_id: Optional[DeploymentID]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be used in #63255

Comment on lines +403 to +404
_request_metadata: RequestMetadata
_method_name: str

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be used in #63254

Comment thread python/ray/serve/_private/replica.py
@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label May 9, 2026
@jeffreywang88 jeffreywang88 changed the title [Serve] Add replica-side slot reservation primitive May 9, 2026
Comment on lines +1192 to +1194
return self._metrics_manager.get_num_ongoing_requests() + len(
self._reserved_slots
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Requests that have actually entered _start_request + capacity already held by reserve_slot() but not yet dispatched.

"Request tried to consume an unknown reserved slot "
f"{reserved_slot_token}."
)
self._reserved_slots.remove(reserved_slot_token)

@jeffreywang88 jeffreywang88 May 9, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: semaphore has already been acquired at reserve_slot()

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>


@PublicAPI(stability="alpha")
class ReplicaUnavailableError(RayServeException):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be used in #63254

@jeffreywang88 jeffreywang88 requested a review from kouroshHakha May 9, 2026 21:58

@kouroshHakha kouroshHakha left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-scoped foundation. The semaphore accounting in _start_request is correct — the refactor maintains the invariant that exactly one acquire maps to exactly one release across both the reserved-slot and classic paths. The drain fix (counting _reserved_slots in get_num_ongoing_requests) is the right call; without it a replica could exit between reserve_slot and dispatch. Tests are solid and the test structure (isolated FakeServeReplicaForSlotReservation, no Ray runtime) is exactly right for this layer.

Three items below, two of which I'd want addressed before merge.

Note

This review was co-written with AI assistance (Claude Code).

Comment thread python/ray/serve/_private/replica.py
Comment thread python/ray/serve/_private/request_router/replica_wrapper.py
Comment thread python/ray/serve/exceptions.py
Comment thread python/ray/serve/tests/unit/test_router.py
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e036f17. Configure here.

"""Returns the replica address in host:port format."""
if self.port:
return f"{self.node_ip}:{self.port}"
return self.node_ip

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Falsy port check excludes valid port zero

Low Severity

The address property uses if self.port: which is falsy for both None and 0. Since port is typed as Optional[int], the intended check is likely if self.port is not None: to distinguish "no port configured" (None) from a valid port number. While port 0 is uncommon in production, this is a new public-facing API where the semantics of the truthiness check may surprise future callers.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e036f17. Configure here.

@kouroshHakha

Copy link
Copy Markdown
Contributor

LGTM

@kouroshHakha kouroshHakha enabled auto-merge (squash) May 10, 2026 06:02
@kouroshHakha kouroshHakha merged commit 4cc69e0 into master May 10, 2026
7 checks passed
@kouroshHakha kouroshHakha deleted the decouple-routing-primitives-1 branch May 10, 2026 06:30
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…#63252)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: machichima <nary12321@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

2 participants