You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Description
- #62323
- #62330
- #62366
## Related issues
> Link related issues: "Fixes#1234", "Closes#1234", or "Related to
#1234".
## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
---------
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
1. The Serve controller creates the `CapacityQueue` deployment actor **before**
200
+
any replicas start. `CapacityQueue` subscribes to replica updates via long poll.
201
+
2. As the controller starts replicas, it sends deployment-target updates. The
202
+
queue's long-poll callback automatically registers each replica with its
203
+
`max_ongoing_requests` capacity and unregisters replicas that are removed
204
+
during scale-down or crash recovery.
205
+
3. The `CapacityQueueRouter` running in each proxy discovers the singleton `CapacityQueue`
206
+
deployment actor, acquires a token for every incoming request, and routes to the replica
207
+
identified by the token.
208
+
4. When the request completes, `CapacityQueueRouter.on_request_completed` fires and the token is
209
+
released back to the queue.
210
+
211
+
Because the queue is a deployment actor, the controller handles its lifecycle
212
+
automatically — health checks, cleanup on app deletion, and versioning during
213
+
rolling updates.
214
+
215
+
### Fault tolerance
216
+
217
+
The `CapacityQueueRouter` handles failures gracefully:
218
+
219
+
-**Queue unavailable** — if the queue actor is dead, not yet discovered, or
220
+
errors, the router retries with exponential backoff and falls back to
221
+
power-of-two-choices after `MAX_FAULT_RETRIES` consecutive failures.
222
+
Requests never raise exceptions due to queue issues.
223
+
-**Capacity exhausted** — when all replicas are at capacity, the router
224
+
backs off and retries until capacity frees up.
225
+
-**Queue restart** — a restarted queue has no knowledge of pre-crash
226
+
in-flight counts and may temporarily over-provision. This self-heals:
227
+
replicas reject excess requests, and the router does not release rejected
228
+
tokens intentionally, ratcheting up `in_flight` on the queue until it
229
+
matches reality. `token_ttl_s` (if configured) auto-reclaims any
230
+
remaining leaked tokens.
231
+
-**Replica death** — the controller sends a long-poll update, the queue
232
+
unregisters the dead replica, and tokens are only issued for live replicas.
233
+
234
+
### Usage
235
+
The centralized capacity queue request router could bring performance benefits particularly in a constrained supply deployment, i.e. `max_ongoing_request=1` or `2`.
236
+
237
+
### Benchmark
238
+
239
+
#### Benchmark Setup
240
+
- Deployment topology: Client -> `ParentDeployment` -> `ChildDeployment`. Request router selection is applied to both deployments,
241
+
controlling how parent replicas are selected by the HTTP proxy and how child replicas are selected by parent's `DeploymentHandle`.
242
+
- Scale: small (8 replicas), medium (32 replicas), large (128 replicas), xlarge (512 replicas).
243
+
- Workload: Replica processing latency is drawn from an exponential distribution with mean 1s and capped at 10s.
244
+
-`max_ongoing_request` is set to `2`.
245
+
- Load generation: Applies closed-loop load generation where the load consistently keeps replicas saturated at `max_ongoing_request` concurrency.
246
+
- Warmup: 10s; metrics within the warmup window are discarded entirely.
247
+
248
+
#### Benchmark Metrics
249
+
- Throughput: Requests per second, i.e. `num_requests / duration`.
250
+
- Utilization: Measures what fraction of a replica's total processing capacity was consumed by actual work during the experiment.
251
+
Concretely, `sum(replica_processing_latency_s) / (duration_s * max_ongoing_requests)`. For GPU deployments, utilization serves as
252
+
an assessment proxy for GPU utilization.
253
+
- Latency: Measures the client-side end-to-end latency, covering the full round-trip --
If you experience the following error when the `CapacityQueue` actor experiences faults and routing decisions fall back to the power-of-two-choices router,
274
+
set `RAY_SERVE_QUEUE_LENGTH_RESPONSE_DEADLINE_S` to a higher value.
275
+
276
+
> Failed to get queue length from Replica(id='...', deployment='ParentDeployment', app='...') within 0.1s.
0 commit comments