[Serve] Fix P99 latency regression: decrement queue length cache on request completion, reduce replica update overhead by abrarsheikh · Pull Request #61755 · ray-project/ray

abrarsheikh · 2026-03-15T23:00:58Z

Fixes elevated P99 latency observed when scaling Ray Serve deployments with max_ongoing_requests=1. The root cause is that the queue length cache is incremented when a request is sent (on_send_request) but never decremented when the request completes, causing cache entries to get "stuck" at values >= max_ongoing_requests. This forces every subsequent routing decision to fall back to blocking probe RPCs instead of using cached values.

This regression was introduced when on_send_request was added in the router refactor (commit de1494e, Aug 2025). Prior to that (Ray <= 2.10), the cache was only updated from replica-reported values (probes and rejection protocol responses), so there was no increment-without-matching-decrement problem.

Changes

1. Decrement queue length cache on request completion (primary fix)

Implements decrement_queue_len_cache in RequestRouter to decrement the cache entry by 1 when a request finishes. This restores the increment/decrement symmetry that was missing since on_send_request was introduced.

With max_ongoing_requests=1, the routing algorithm in _select_from_candidate_replicas treats any cache entry >= max_ongoing_requests as "needs probing". Before this fix, every routed request would bump the cache to 1, and it would stay there until either the 10s TTL expired or a probe happened to refresh it. This meant the cache was nearly useless, and most routing decisions required a blocking probe RPC (~20-40ms round trip), directly explaining the observed P99 increase.

2. Reuse existing replica wrappers in _update_running_replicas

Previously, every replica update created new RunningReplica wrappers for all replicas, even those that hadn't changed. During scaling storms (100+ updates with 250+ replicas each), this caused O(n) synchronous work per update on the router's event loop.

Now reuses existing wrappers for known replicas, only creating wrappers for genuinely new ones. This reduces per-update work from O(all_replicas) to O(new_replicas).

3. Reduce replica update log noise

The "Got updated replicas" log line previously serialized every replica ID (250+) into a string on every update. Changed to log only the total count and the added/removed counts, reducing both log volume and the synchronous formatting cost on the event loop.

Load Test Results

Scale	Client	Master QPS	Master P99 Latency	Optimized QPS	Optimized P99 Latency
Up to 100 Users
Up to 200 Users

…equest completion, reduce replica update overhead Signed-off-by: abrar <abrar@anyscale.com>

gemini-code-assist

Code Review

This pull request effectively addresses a P99 latency regression by ensuring the replica queue length cache is decremented upon request completion. The implementation is robust, handling edge cases like expired cache entries and preventing negative queue lengths. The accompanying unit tests are comprehensive and validate the fix thoroughly. Additionally, the pull request includes valuable optimizations that reduce replica update overhead and log noise, improving overall system performance and observability during scaling events. The changes are well-structured and clear. Excellent work.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

Signed-off-by: abrar <abrar@anyscale.com>

alvinttang

Code Review

This is a well-analyzed fix with clear root cause identification and load test validation. The increment-without-decrement asymmetry in the queue length cache is a classic bug pattern.

Issues found

1. Thread safety of decrement_queue_len_cache called from add_done_callback

The result.add_done_callback(lambda _: self.request_router.decrement_queue_len_cache(...)) in router.py runs the callback from whatever thread/context completes the future. If the ReplicaResult future is resolved by a different asyncio task or thread (e.g., a gRPC callback thread), decrement_queue_len_cache accesses _replica_queue_len_cache without synchronization. The cache's get() and update() methods don't appear to be thread-safe (they're dict operations). This could cause rare data races under high concurrency.

Consider wrapping the decrement in loop.call_soon_threadsafe() or ensuring the callback always runs on the router's event loop.

2. Decrement on failed/cancelled requests

The add_done_callback fires regardless of whether the request succeeded, failed, or was cancelled. The docstring for decrement_queue_len_cache says "Should NOT be called for rejected requests" but doesn't mention failed/cancelled ones. If a request fails after being sent (e.g., replica crashes mid-request), the decrement still fires. This is probably correct (the request did leave the queue), but worth documenting explicitly.

3. The update_replica_info optimization assumes immutable fields stay immutable

update_replica_info only updates _replica_info and _multiplexed_model_ids. If other fields on RunningReplicaInfo change (e.g., max_ongoing_requests changes during a config update), the wrapper won't pick them up. The current RunningReplica base class caches max_ongoing_requests from replica_info in __init__ but update_replica_info doesn't refresh it. This is probably fine today if max_ongoing_requests truly never changes for a live replica, but it's a subtle contract.

4. Lambda closure captures replica by reference

result.add_done_callback(
    lambda _: self.request_router.decrement_queue_len_cache(
        replica.replica_id
    )
)

This is inside a loop (async for replica, result in ...), but since there's a return result immediately after, replica won't be re-bound before the lambda executes. So this is safe in practice. Just noting it since lambda-in-loop is a common Python gotcha.

Positive notes

The replica wrapper reuse optimization (_update_running_replicas) is a clean O(n) -> O(new) improvement
The log noise reduction (count-based instead of serializing all IDs) is a nice touch
Comprehensive test coverage including edge cases (expired cache, negative clamp, multiple decrements)
The FakeRunningReplica.update_replica_info override in tests shows good attention to the test infrastructure

Solid PR overall. The thread safety concern (point 1) is the main thing I'd want verified before merging.

…_dec

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh · 2026-03-16T16:18:44Z

Thanks @alvinttang for the review, i have added the thread safety, i am a little concerned about the performance overhead from this. I will run a full set of benchmarks to confirm before merging.

Signed-off-by: abrar <abrar@anyscale.com>

akyang-anyscale · 2026-03-16T21:08:46Z

            callback_registered = True

            if not with_rejection:
+                result.add_done_callback(


can we move this logic to _process_finished_request?

no because _process_finished_request doesn't have access to if the request was rejected or not.

This reverts commit 864d358. Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh · 2026-03-17T00:42:35Z

Throughput metrics (http_avg_rps, grpc_avg_rps, handle_avg_rps) are within ~1–3%
Most latency metrics are within ±5%
A few p90/p95/p99 outliers (~10–25%) appear in both directions and are consistent with normal benchmark variance.

Conclusion:
There is no meaningful performance regression from the thread-safe implementation. Throughput remains effectively unchanged and latency differences are within expected run-to-run noise.

abrarsheikh · 2026-03-17T00:43:40Z

link to release test https://buildkite.com/ray-project/release/builds/84297
rerunning master release tests again to eliminate noise https://buildkite.com/ray-project/release/builds/84431

abrarsheikh · 2026-03-17T03:26:43Z

master_vs_thread_vs_nonthread_sorted.csv

… callbacks Remove unnecessary `call_soon_threadsafe` wrapping from request completion callbacks in `AsyncioRouter._try_assign_replica`. This wrapping was introduced in ray-project#61755 (6a43c01) and causes a measurable throughput regression on single-replica no-op benchmarks due to the per-request overhead of cross-thread event loop wakeups (mutex acquire + pipe write per callback). The `decrement_queue_len_cache` callbacks are retained but called directly instead of being scheduled via `call_soon_threadsafe`. ## Benchmark results (m7a.4xlarge, 16 vCPU) | Metric | Before (HEAD) | After (this fix) | Pre-regression baseline | |------------- |---------------|-------------------|-------------------------| | handle_100 | 1627 RPS | 1768 RPS (+8.7%) | 1775 RPS | | handle_800 | 1436 RPS | 1584 RPS (+10.3%) | 1584 RPS | | http | 475 RPS | 520 RPS (+9.5%) | 538 RPS | | http_100 | 710 RPS | 741 RPS (+4.4%) | 739 RPS | | http_800 | 882 RPS | 967 RPS (+9.6%) | 955 RPS | Signed-off-by: seiji <seiji@anyscale.com>

Remove call_soon_threadsafe wrapping from _process_finished_request by making the happy path thread-safe: - dec_num_running_requests_for_replica: already uses _queries_lock - on_request_completed: no-op in base RequestRouter - cache decrement: GIL-atomic deque.append, drained before routing Error paths (actor died/unavailable) are deferred to the event loop via call_soon_threadsafe since they mutate shared replica state. A single add_done_callback is used per request for both the no-rejection and with-rejection paths. The with-rejection path uses a mutable flag set after acceptance to control cache decrement. Recovers handle_100 throughput to within 1% of pre-ray-project#61755 baseline. Signed-off-by: seiji <seiji@anyscale.com>

…equest completion, reduce replica update overhead (ray-project#61755) Fixes elevated P99 latency observed when scaling Ray Serve deployments with `max_ongoing_requests=1`. The root cause is that the queue length cache is incremented when a request is sent (`on_send_request`) but never decremented when the request completes, causing cache entries to get "stuck" at values >= `max_ongoing_requests`. This forces every subsequent routing decision to fall back to blocking probe RPCs instead of using cached values. This regression was introduced when `on_send_request` was added in the router refactor (commit de1494e, Aug 2025). Prior to that (Ray <= 2.10), the cache was only updated from replica-reported values (probes and rejection protocol responses), so there was no increment-without-matching-decrement problem. ### Changes **1. Decrement queue length cache on request completion (primary fix)** Implements `decrement_queue_len_cache` in `RequestRouter` to decrement the cache entry by 1 when a request finishes. This restores the increment/decrement symmetry that was missing since `on_send_request` was introduced. With `max_ongoing_requests=1`, the routing algorithm in `_select_from_candidate_replicas` treats any cache entry >= `max_ongoing_requests` as "needs probing". Before this fix, every routed request would bump the cache to 1, and it would stay there until either the 10s TTL expired or a probe happened to refresh it. This meant the cache was nearly useless, and most routing decisions required a blocking probe RPC (~20-40ms round trip), directly explaining the observed P99 increase. **2. Reuse existing replica wrappers in `_update_running_replicas`** Previously, every replica update created new `RunningReplica` wrappers for *all* replicas, even those that hadn't changed. During scaling storms (100+ updates with 250+ replicas each), this caused O(n) synchronous work per update on the router's event loop. Now reuses existing wrappers for known replicas, only creating wrappers for genuinely new ones. This reduces per-update work from O(all_replicas) to O(new_replicas). **3. Reduce replica update log noise** The "Got updated replicas" log line previously serialized every replica ID (250+) into a string on every update. Changed to log only the total count and the added/removed counts, reducing both log volume and the synchronous formatting cost on the event loop. ## Load Test Results | Scale | Client | Master QPS | Master P99 Latency | Optimized QPS | Optimized P99 Latency | |------|--------|------------|--------------------|---------------|-----------------------| | **Up to 100 Users** | <img src="https://github.com/user-attachments/assets/d87573da-9fb8-4c22-b93d-04ce4cac2635" width="250"> | <img src="https://github.com/user-attachments/assets/d0de09c8-5932-4ea0-9988-cb798e8d6328" width="400"> | <img src="https://github.com/user-attachments/assets/c46dc371-c19d-4e2d-8327-b78648e2c393" width="400"> | <img src="https://github.com/user-attachments/assets/fccfd88c-6cd2-4d56-bf32-da5cec54e937" width="400"> | <img src="https://github.com/user-attachments/assets/6fc9260d-f164-415d-baa9-da8a837a1d23" width="400"> | | **Up to 200 Users** | <img src="https://github.com/user-attachments/assets/8455e66d-9945-480c-89eb-78c1d3641e4e" width="250"> | <img src="https://github.com/user-attachments/assets/5a755c67-3588-4add-bfa5-f7e974f8b547" width="400"> | <img src="https://github.com/user-attachments/assets/2a7cfe6d-c660-4433-b350-a2b406421a70" width="400"> | <img src="https://github.com/user-attachments/assets/1ecea44d-2eb0-445f-a9e2-70b61eb4eba9" width="400"> | <img src="https://github.com/user-attachments/assets/e683bffb-9c45-4755-a86c-c6f39f84568b" width="400"> | --------- Signed-off-by: abrar <abrar@anyscale.com>

[Serve] Fix P99 latency regression: decrement queue length cache on r…

f637554

…equest completion, reduce replica update overhead Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh requested a review from a team as a code owner March 15, 2026 23:00

abrarsheikh added the go add ONLY when ready to merge, run all tests label Mar 15, 2026

gemini-code-assist Bot reviewed Mar 15, 2026

View reviewed changes

cursor Bot reviewed Mar 15, 2026

View reviewed changes

Comment thread python/ray/serve/_private/request_router/request_router.py Outdated

Comment thread python/ray/serve/_private/request_router/request_router.py Outdated

fixes

1c63335

Signed-off-by: abrar <abrar@anyscale.com>

ray-gardener Bot added the serve Ray Serve Related Issue label Mar 16, 2026

alvinttang reviewed Mar 16, 2026

View reviewed changes

abrarsheikh added 2 commits March 16, 2026 15:38

Merge branch 'master' of github.com:ray-project/ray into router_cache…

85fafe6

…_dec

add thread safety

e3ef849

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh requested a review from akyang-anyscale March 16, 2026 16:36

revert thread safety

864d358

Signed-off-by: abrar <abrar@anyscale.com>

akyang-anyscale reviewed Mar 16, 2026

View reviewed changes

akyang-anyscale approved these changes Mar 16, 2026

View reviewed changes

Revert "revert thread safety"

3d9bc39

This reverts commit 864d358. Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh force-pushed the router_cache_dec branch from b4bbca9 to ba17a57 Compare March 17, 2026 00:38

abrarsheikh requested review from a team as code owners March 17, 2026 00:38

abrarsheikh force-pushed the router_cache_dec branch from ba17a57 to 3d9bc39 Compare March 17, 2026 00:40

abrarsheikh removed request for a team March 17, 2026 00:40

abrarsheikh merged commit 6a43c01 into master Mar 17, 2026
6 checks passed

abrarsheikh deleted the router_cache_dec branch March 17, 2026 03:33

eicherseiji mentioned this pull request Mar 30, 2026

[Serve] Fix throughput regression from call_soon_threadsafe in router callbacks #62218

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Serve] Fix P99 latency regression: decrement queue length cache on request completion, reduce replica update overhead#61755

[Serve] Fix P99 latency regression: decrement queue length cache on request completion, reduce replica update overhead#61755
abrarsheikh merged 6 commits into
masterfrom
router_cache_dec

abrarsheikh commented Mar 15, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

cursor Bot left a comment

Uh oh!

Uh oh!

alvinttang left a comment

abrarsheikh commented Mar 16, 2026

akyang-anyscale Mar 16, 2026

abrarsheikh Mar 16, 2026

abrarsheikh commented Mar 17, 2026

abrarsheikh commented Mar 17, 2026

abrarsheikh commented Mar 17, 2026

Uh oh!

Labels

3 participants

Uh oh!

Conversation

abrarsheikh commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Load Test Results

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alvinttang left a comment

Choose a reason for hiding this comment

Code Review

Issues found

Positive notes

abrarsheikh commented Mar 16, 2026

akyang-anyscale Mar 16, 2026

Choose a reason for hiding this comment

abrarsheikh Mar 16, 2026

Choose a reason for hiding this comment

abrarsheikh commented Mar 17, 2026

abrarsheikh commented Mar 17, 2026

abrarsheikh commented Mar 17, 2026

Uh oh!

Labels

3 participants

abrarsheikh commented Mar 15, 2026 •

edited

Loading