[serve][1/3] Add replica-side slot reservation primitive#63252
Conversation
738e50b to
18d7ec5
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a slot reservation mechanism for Ray Serve replicas to manage capacity during the choose-and-dispatch process. Key changes include adding reserve_slot and release_slot methods to the replica implementation, updating RequestMetadata to track reservation tokens, and introducing a ReplicaSelection class to manage the reservation lifecycle. Additionally, a new ReplicaUnavailableError is added. Feedback was provided regarding a potential blocking call in the reserve_slot method that could occur if a race condition happens between the capacity check and semaphore acquisition.
| if not self._can_accept_request(request_metadata): | ||
| return False, self.get_num_ongoing_requests() | ||
|
|
||
| await self._semaphore.acquire() |
There was a problem hiding this comment.
The reserve_slot method checks _can_accept_request before awaiting the semaphore. While this provides an early exit, await self._semaphore.acquire() can still block if a race condition occurs between the check and the acquisition. If the intention is for the router to never block on a specific replica during selection, consider using a non-blocking acquisition or a timeout.
Co-Authored-By: machichima <nary12321@gmail.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
18d7ec5 to
b22b7d9
Compare
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…thInfo wrapper) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
|
|
||
| # Internal fields (not part of public API) | ||
| _replica: RunningReplica | ||
| _deployment_id: Optional[DeploymentID] |
| _request_metadata: RequestMetadata | ||
| _method_name: str |
| return self._metrics_manager.get_num_ongoing_requests() + len( | ||
| self._reserved_slots | ||
| ) |
There was a problem hiding this comment.
Note: Requests that have actually entered _start_request + capacity already held by reserve_slot() but not yet dispatched.
| "Request tried to consume an unknown reserved slot " | ||
| f"{reserved_slot_token}." | ||
| ) | ||
| self._reserved_slots.remove(reserved_slot_token) |
There was a problem hiding this comment.
Note: semaphore has already been acquired at reserve_slot()
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
|
|
||
|
|
||
| @PublicAPI(stability="alpha") | ||
| class ReplicaUnavailableError(RayServeException): |
kouroshHakha
left a comment
There was a problem hiding this comment.
Clean, well-scoped foundation. The semaphore accounting in _start_request is correct — the refactor maintains the invariant that exactly one acquire maps to exactly one release across both the reserved-slot and classic paths. The drain fix (counting _reserved_slots in get_num_ongoing_requests) is the right call; without it a replica could exit between reserve_slot and dispatch. Tests are solid and the test structure (isolated FakeServeReplicaForSlotReservation, no Ray runtime) is exactly right for this layer.
Three items below, two of which I'd want addressed before merge.
Note
This review was co-written with AI assistance (Claude Code).
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e036f17. Configure here.
| """Returns the replica address in host:port format.""" | ||
| if self.port: | ||
| return f"{self.node_ip}:{self.port}" | ||
| return self.node_ip |
There was a problem hiding this comment.
Falsy port check excludes valid port zero
Low Severity
The address property uses if self.port: which is falsy for both None and 0. Since port is typed as Optional[int], the intended check is likely if self.port is not None: to distinguish "no port configured" (None) from a valid port number. While port 0 is uncommon in production, this is a new public-facing API where the semantics of the truthiness check may surprise future callers.
Reviewed by Cursor Bugbot for commit e036f17. Configure here.
|
LGTM |
…#63252) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: machichima <nary12321@gmail.com>


Description
The
choose_replica/dispatchflow needs the router to hold a slot on the replica before dispatching, so that when dispatch finally sends the request it can usewith_rejection=False(slot already reserved on the actor). This PR adds the underlying primitive: an actor-sidereserve_slot/release_slotpair backed by the existing semaphore, plus the wire format andReplicaSelectionwrapper. There are no callers yet — the router integration is the next PR.Lifecycle
This PR adds the primitives a future caller (implemented in #63254 and #63255) will use as a four-step lifecycle:
max_ongoing_requestssemaphore before the request is built; get back a token and ground-truth(accepted, num_ongoing_requests).ReplicaSelection(address + node + AZ) the caller can pass around.RequestMetadata; the replica skips re-acquiring the semaphore on the way in.Primitives
Replica.reserve_slot/release_slot(exposed throughReplicaActor): the actor-side primitive, leveraging the existing_start_requestsemaphore so reservations count against the same capacity bound and show up inget_num_ongoing_requests().RuntimeError("Slot reservation not supported for Java.")since there's no actor-side semaphore on the Java replica.RunningReplica.reserve_slot/release_slot: async wrappers over the actor RPC that returnReplicaQueueLengthInfofor cache updates.ReplicaSelection: wraps the token plus replica metadata; enforces single dispatch via_mark_dispatched.ReplicaUnavailableError: raised when a selection is invalidated before dispatch.FAQ
1. What's still missing in this PR?
reserve_slotyet — the primitive exists but is unwired.RunningReplica.reserve_slotdoesn't yet retry onActorDiedError/ActorUnavailableError.AsyncioRouter.choose_replicaRouterabstract base,SingletonThreadRouter,CurrentLoopRouter, andDeploymentHandlehave nochoose_replica/dispatchmethods yet.2. Why do we need both token and replica-side semaphore?
Reservation creates a gap:
reserve_slotanddispatchare two separate router → actor RPCs, and the design needs to handle both capacity and identity across that gap.The semaphore is the capacity gate. It's anonymous, shared with every other entry path (handle_request, handle_request_with_rejection), so all paths agree on what "at capacity" means.
The token is the reservation's identity. The semaphore alone can't tell you:
Related issues
RFC: #59792
Original PR: #60865
Next PR: #63254
Additional information