[Serve] Fix P99 latency regression: decrement queue length cache on request completion, reduce replica update overhead#61755
Conversation
…equest completion, reduce replica update overhead Signed-off-by: abrar <abrar@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a P99 latency regression by ensuring the replica queue length cache is decremented upon request completion. The implementation is robust, handling edge cases like expired cache entries and preventing negative queue lengths. The accompanying unit tests are comprehensive and validate the fix thoroughly. Additionally, the pull request includes valuable optimizations that reduce replica update overhead and log noise, improving overall system performance and observability during scaling events. The changes are well-structured and clear. Excellent work.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
alvinttang
left a comment
There was a problem hiding this comment.
Code Review
This is a well-analyzed fix with clear root cause identification and load test validation. The increment-without-decrement asymmetry in the queue length cache is a classic bug pattern.
Issues found
1. Thread safety of decrement_queue_len_cache called from add_done_callback
The result.add_done_callback(lambda _: self.request_router.decrement_queue_len_cache(...)) in router.py runs the callback from whatever thread/context completes the future. If the ReplicaResult future is resolved by a different asyncio task or thread (e.g., a gRPC callback thread), decrement_queue_len_cache accesses _replica_queue_len_cache without synchronization. The cache's get() and update() methods don't appear to be thread-safe (they're dict operations). This could cause rare data races under high concurrency.
Consider wrapping the decrement in loop.call_soon_threadsafe() or ensuring the callback always runs on the router's event loop.
2. Decrement on failed/cancelled requests
The add_done_callback fires regardless of whether the request succeeded, failed, or was cancelled. The docstring for decrement_queue_len_cache says "Should NOT be called for rejected requests" but doesn't mention failed/cancelled ones. If a request fails after being sent (e.g., replica crashes mid-request), the decrement still fires. This is probably correct (the request did leave the queue), but worth documenting explicitly.
3. The update_replica_info optimization assumes immutable fields stay immutable
update_replica_info only updates _replica_info and _multiplexed_model_ids. If other fields on RunningReplicaInfo change (e.g., max_ongoing_requests changes during a config update), the wrapper won't pick them up. The current RunningReplica base class caches max_ongoing_requests from replica_info in __init__ but update_replica_info doesn't refresh it. This is probably fine today if max_ongoing_requests truly never changes for a live replica, but it's a subtle contract.
4. Lambda closure captures replica by reference
result.add_done_callback(
lambda _: self.request_router.decrement_queue_len_cache(
replica.replica_id
)
)This is inside a loop (async for replica, result in ...), but since there's a return result immediately after, replica won't be re-bound before the lambda executes. So this is safe in practice. Just noting it since lambda-in-loop is a common Python gotcha.
Positive notes
- The replica wrapper reuse optimization (
_update_running_replicas) is a clean O(n) -> O(new) improvement - The log noise reduction (count-based instead of serializing all IDs) is a nice touch
- Comprehensive test coverage including edge cases (expired cache, negative clamp, multiple decrements)
- The
FakeRunningReplica.update_replica_infooverride in tests shows good attention to the test infrastructure
Solid PR overall. The thread safety concern (point 1) is the main thing I'd want verified before merging.
Signed-off-by: abrar <abrar@anyscale.com>
|
Thanks @alvinttang for the review, i have added the thread safety, i am a little concerned about the performance overhead from this. I will run a full set of benchmarks to confirm before merging. |
Signed-off-by: abrar <abrar@anyscale.com>
| callback_registered = True | ||
|
|
||
| if not with_rejection: | ||
| result.add_done_callback( |
There was a problem hiding this comment.
can we move this logic to _process_finished_request?
There was a problem hiding this comment.
no because _process_finished_request doesn't have access to if the request was rejected or not.
This reverts commit 864d358. Signed-off-by: abrar <abrar@anyscale.com>
b4bbca9 to
ba17a57
Compare
ba17a57 to
3d9bc39
Compare
Conclusion: |
|
link to release test https://buildkite.com/ray-project/release/builds/84297 |
… callbacks Remove unnecessary `call_soon_threadsafe` wrapping from request completion callbacks in `AsyncioRouter._try_assign_replica`. This wrapping was introduced in ray-project#61755 (6a43c01) and causes a measurable throughput regression on single-replica no-op benchmarks due to the per-request overhead of cross-thread event loop wakeups (mutex acquire + pipe write per callback). The `decrement_queue_len_cache` callbacks are retained but called directly instead of being scheduled via `call_soon_threadsafe`. ## Benchmark results (m7a.4xlarge, 16 vCPU) | Metric | Before (HEAD) | After (this fix) | Pre-regression baseline | |------------- |---------------|-------------------|-------------------------| | handle_100 | 1627 RPS | 1768 RPS (+8.7%) | 1775 RPS | | handle_800 | 1436 RPS | 1584 RPS (+10.3%) | 1584 RPS | | http | 475 RPS | 520 RPS (+9.5%) | 538 RPS | | http_100 | 710 RPS | 741 RPS (+4.4%) | 739 RPS | | http_800 | 882 RPS | 967 RPS (+9.6%) | 955 RPS | Signed-off-by: seiji <seiji@anyscale.com>
Remove call_soon_threadsafe wrapping from _process_finished_request by making the happy path thread-safe: - dec_num_running_requests_for_replica: already uses _queries_lock - on_request_completed: no-op in base RequestRouter - cache decrement: GIL-atomic deque.append, drained before routing Error paths (actor died/unavailable) are deferred to the event loop via call_soon_threadsafe since they mutate shared replica state. A single add_done_callback is used per request for both the no-rejection and with-rejection paths. The with-rejection path uses a mutable flag set after acceptance to control cache decrement. Recovers handle_100 throughput to within 1% of pre-ray-project#61755 baseline. Signed-off-by: seiji <seiji@anyscale.com>
…equest completion, reduce replica update overhead (ray-project#61755) Fixes elevated P99 latency observed when scaling Ray Serve deployments with `max_ongoing_requests=1`. The root cause is that the queue length cache is incremented when a request is sent (`on_send_request`) but never decremented when the request completes, causing cache entries to get "stuck" at values >= `max_ongoing_requests`. This forces every subsequent routing decision to fall back to blocking probe RPCs instead of using cached values. This regression was introduced when `on_send_request` was added in the router refactor (commit de1494e, Aug 2025). Prior to that (Ray <= 2.10), the cache was only updated from replica-reported values (probes and rejection protocol responses), so there was no increment-without-matching-decrement problem. ### Changes **1. Decrement queue length cache on request completion (primary fix)** Implements `decrement_queue_len_cache` in `RequestRouter` to decrement the cache entry by 1 when a request finishes. This restores the increment/decrement symmetry that was missing since `on_send_request` was introduced. With `max_ongoing_requests=1`, the routing algorithm in `_select_from_candidate_replicas` treats any cache entry >= `max_ongoing_requests` as "needs probing". Before this fix, every routed request would bump the cache to 1, and it would stay there until either the 10s TTL expired or a probe happened to refresh it. This meant the cache was nearly useless, and most routing decisions required a blocking probe RPC (~20-40ms round trip), directly explaining the observed P99 increase. **2. Reuse existing replica wrappers in `_update_running_replicas`** Previously, every replica update created new `RunningReplica` wrappers for *all* replicas, even those that hadn't changed. During scaling storms (100+ updates with 250+ replicas each), this caused O(n) synchronous work per update on the router's event loop. Now reuses existing wrappers for known replicas, only creating wrappers for genuinely new ones. This reduces per-update work from O(all_replicas) to O(new_replicas). **3. Reduce replica update log noise** The "Got updated replicas" log line previously serialized every replica ID (250+) into a string on every update. Changed to log only the total count and the added/removed counts, reducing both log volume and the synchronous formatting cost on the event loop. ## Load Test Results | Scale | Client | Master QPS | Master P99 Latency | Optimized QPS | Optimized P99 Latency | |------|--------|------------|--------------------|---------------|-----------------------| | **Up to 100 Users** | <img src="https://github.com/user-attachments/assets/d87573da-9fb8-4c22-b93d-04ce4cac2635" width="250"> | <img src="https://github.com/user-attachments/assets/d0de09c8-5932-4ea0-9988-cb798e8d6328" width="400"> | <img src="https://github.com/user-attachments/assets/c46dc371-c19d-4e2d-8327-b78648e2c393" width="400"> | <img src="https://github.com/user-attachments/assets/fccfd88c-6cd2-4d56-bf32-da5cec54e937" width="400"> | <img src="https://github.com/user-attachments/assets/6fc9260d-f164-415d-baa9-da8a837a1d23" width="400"> | | **Up to 200 Users** | <img src="https://github.com/user-attachments/assets/8455e66d-9945-480c-89eb-78c1d3641e4e" width="250"> | <img src="https://github.com/user-attachments/assets/5a755c67-3588-4add-bfa5-f7e974f8b547" width="400"> | <img src="https://github.com/user-attachments/assets/2a7cfe6d-c660-4433-b350-a2b406421a70" width="400"> | <img src="https://github.com/user-attachments/assets/1ecea44d-2eb0-445f-a9e2-70b61eb4eba9" width="400"> | <img src="https://github.com/user-attachments/assets/e683bffb-9c45-4755-a86c-c6f39f84568b" width="400"> | --------- Signed-off-by: abrar <abrar@anyscale.com>

Fixes elevated P99 latency observed when scaling Ray Serve deployments with
max_ongoing_requests=1. The root cause is that the queue length cache is incremented when a request is sent (on_send_request) but never decremented when the request completes, causing cache entries to get "stuck" at values >=max_ongoing_requests. This forces every subsequent routing decision to fall back to blocking probe RPCs instead of using cached values.This regression was introduced when
on_send_requestwas added in the router refactor (commit de1494e, Aug 2025). Prior to that (Ray <= 2.10), the cache was only updated from replica-reported values (probes and rejection protocol responses), so there was no increment-without-matching-decrement problem.Changes
1. Decrement queue length cache on request completion (primary fix)
Implements
decrement_queue_len_cacheinRequestRouterto decrement the cache entry by 1 when a request finishes. This restores the increment/decrement symmetry that was missing sinceon_send_requestwas introduced.With
max_ongoing_requests=1, the routing algorithm in_select_from_candidate_replicastreats any cache entry >=max_ongoing_requestsas "needs probing". Before this fix, every routed request would bump the cache to 1, and it would stay there until either the 10s TTL expired or a probe happened to refresh it. This meant the cache was nearly useless, and most routing decisions required a blocking probe RPC (~20-40ms round trip), directly explaining the observed P99 increase.2. Reuse existing replica wrappers in
_update_running_replicasPreviously, every replica update created new
RunningReplicawrappers for all replicas, even those that hadn't changed. During scaling storms (100+ updates with 250+ replicas each), this caused O(n) synchronous work per update on the router's event loop.Now reuses existing wrappers for known replicas, only creating wrappers for genuinely new ones. This reduces per-update work from O(all_replicas) to O(new_replicas).
3. Reduce replica update log noise
The "Got updated replicas" log line previously serialized every replica ID (250+) into a string on every update. Changed to log only the total count and the added/removed counts, reducing both log volume and the synchronous formatting cost on the event loop.
Load Test Results