Skip to content

[Serve] Fix P99 latency regression: decrement queue length cache on request completion, reduce replica update overhead#61755

Merged
abrarsheikh merged 6 commits into
masterfrom
router_cache_dec
Mar 17, 2026
Merged

[Serve] Fix P99 latency regression: decrement queue length cache on request completion, reduce replica update overhead#61755
abrarsheikh merged 6 commits into
masterfrom
router_cache_dec

Conversation

@abrarsheikh

@abrarsheikh abrarsheikh commented Mar 15, 2026

Copy link
Copy Markdown
Contributor

Fixes elevated P99 latency observed when scaling Ray Serve deployments with max_ongoing_requests=1. The root cause is that the queue length cache is incremented when a request is sent (on_send_request) but never decremented when the request completes, causing cache entries to get "stuck" at values >= max_ongoing_requests. This forces every subsequent routing decision to fall back to blocking probe RPCs instead of using cached values.

This regression was introduced when on_send_request was added in the router refactor (commit de1494e, Aug 2025). Prior to that (Ray <= 2.10), the cache was only updated from replica-reported values (probes and rejection protocol responses), so there was no increment-without-matching-decrement problem.

Changes

1. Decrement queue length cache on request completion (primary fix)

Implements decrement_queue_len_cache in RequestRouter to decrement the cache entry by 1 when a request finishes. This restores the increment/decrement symmetry that was missing since on_send_request was introduced.

With max_ongoing_requests=1, the routing algorithm in _select_from_candidate_replicas treats any cache entry >= max_ongoing_requests as "needs probing". Before this fix, every routed request would bump the cache to 1, and it would stay there until either the 10s TTL expired or a probe happened to refresh it. This meant the cache was nearly useless, and most routing decisions required a blocking probe RPC (~20-40ms round trip), directly explaining the observed P99 increase.

2. Reuse existing replica wrappers in _update_running_replicas

Previously, every replica update created new RunningReplica wrappers for all replicas, even those that hadn't changed. During scaling storms (100+ updates with 250+ replicas each), this caused O(n) synchronous work per update on the router's event loop.

Now reuses existing wrappers for known replicas, only creating wrappers for genuinely new ones. This reduces per-update work from O(all_replicas) to O(new_replicas).

3. Reduce replica update log noise

The "Got updated replicas" log line previously serialized every replica ID (250+) into a string on every update. Changed to log only the total count and the added/removed counts, reducing both log volume and the synchronous formatting cost on the event loop.

Load Test Results

Scale Client Master QPS Master P99 Latency Optimized QPS Optimized P99 Latency
Up to 100 Users
Up to 200 Users
…equest completion, reduce replica update overhead

Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh requested a review from a team as a code owner March 15, 2026 23:00
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Mar 15, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a P99 latency regression by ensuring the replica queue length cache is decremented upon request completion. The implementation is robust, handling edge cases like expired cache entries and preventing negative queue lengths. The accompanying unit tests are comprehensive and validate the fix thoroughly. Additionally, the pull request includes valuable optimizations that reduce replica update overhead and log noise, improving overall system performance and observability during scaling events. The changes are well-structured and clear. Excellent work.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Comment thread python/ray/serve/_private/request_router/request_router.py Outdated
Comment thread python/ray/serve/_private/request_router/request_router.py Outdated
Signed-off-by: abrar <abrar@anyscale.com>
@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label Mar 16, 2026

@alvinttang alvinttang left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a well-analyzed fix with clear root cause identification and load test validation. The increment-without-decrement asymmetry in the queue length cache is a classic bug pattern.

Issues found

1. Thread safety of decrement_queue_len_cache called from add_done_callback

The result.add_done_callback(lambda _: self.request_router.decrement_queue_len_cache(...)) in router.py runs the callback from whatever thread/context completes the future. If the ReplicaResult future is resolved by a different asyncio task or thread (e.g., a gRPC callback thread), decrement_queue_len_cache accesses _replica_queue_len_cache without synchronization. The cache's get() and update() methods don't appear to be thread-safe (they're dict operations). This could cause rare data races under high concurrency.

Consider wrapping the decrement in loop.call_soon_threadsafe() or ensuring the callback always runs on the router's event loop.

2. Decrement on failed/cancelled requests

The add_done_callback fires regardless of whether the request succeeded, failed, or was cancelled. The docstring for decrement_queue_len_cache says "Should NOT be called for rejected requests" but doesn't mention failed/cancelled ones. If a request fails after being sent (e.g., replica crashes mid-request), the decrement still fires. This is probably correct (the request did leave the queue), but worth documenting explicitly.

3. The update_replica_info optimization assumes immutable fields stay immutable

update_replica_info only updates _replica_info and _multiplexed_model_ids. If other fields on RunningReplicaInfo change (e.g., max_ongoing_requests changes during a config update), the wrapper won't pick them up. The current RunningReplica base class caches max_ongoing_requests from replica_info in __init__ but update_replica_info doesn't refresh it. This is probably fine today if max_ongoing_requests truly never changes for a live replica, but it's a subtle contract.

4. Lambda closure captures replica by reference

result.add_done_callback(
    lambda _: self.request_router.decrement_queue_len_cache(
        replica.replica_id
    )
)

This is inside a loop (async for replica, result in ...), but since there's a return result immediately after, replica won't be re-bound before the lambda executes. So this is safe in practice. Just noting it since lambda-in-loop is a common Python gotcha.

Positive notes

  • The replica wrapper reuse optimization (_update_running_replicas) is a clean O(n) -> O(new) improvement
  • The log noise reduction (count-based instead of serializing all IDs) is a nice touch
  • Comprehensive test coverage including edge cases (expired cache, negative clamp, multiple decrements)
  • The FakeRunningReplica.update_replica_info override in tests shows good attention to the test infrastructure

Solid PR overall. The thread safety concern (point 1) is the main thing I'd want verified before merging.

@abrarsheikh

Copy link
Copy Markdown
Contributor Author

Thanks @alvinttang for the review, i have added the thread safety, i am a little concerned about the performance overhead from this. I will run a full set of benchmarks to confirm before merging.

Signed-off-by: abrar <abrar@anyscale.com>
callback_registered = True

if not with_rejection:
result.add_done_callback(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this logic to _process_finished_request?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no because _process_finished_request doesn't have access to if the request was rejected or not.

This reverts commit 864d358.

Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh requested review from a team as code owners March 17, 2026 00:38
@abrarsheikh abrarsheikh removed request for a team March 17, 2026 00:40
@abrarsheikh

Copy link
Copy Markdown
Contributor Author
  • Throughput metrics (http_avg_rps, grpc_avg_rps, handle_avg_rps) are within ~1–3%
  • Most latency metrics are within ±5%
  • A few p90/p95/p99 outliers (~10–25%) appear in both directions and are consistent with normal benchmark variance.

Conclusion:
There is no meaningful performance regression from the thread-safe implementation. Throughput remains effectively unchanged and latency differences are within expected run-to-run noise.

@abrarsheikh

Copy link
Copy Markdown
Contributor Author

link to release test https://buildkite.com/ray-project/release/builds/84297
rerunning master release tests again to eliminate noise https://buildkite.com/ray-project/release/builds/84431

@abrarsheikh

Copy link
Copy Markdown
Contributor Author
@abrarsheikh abrarsheikh merged commit 6a43c01 into master Mar 17, 2026
6 checks passed
@abrarsheikh abrarsheikh deleted the router_cache_dec branch March 17, 2026 03:33
eicherseiji added a commit to eicherseiji/ray that referenced this pull request Mar 30, 2026
… callbacks

Remove unnecessary `call_soon_threadsafe` wrapping from request completion
callbacks in `AsyncioRouter._try_assign_replica`. This wrapping was introduced
in ray-project#61755 (6a43c01) and causes a measurable throughput regression on
single-replica no-op benchmarks due to the per-request overhead of cross-thread
event loop wakeups (mutex acquire + pipe write per callback).

The `decrement_queue_len_cache` callbacks are retained but called directly
instead of being scheduled via `call_soon_threadsafe`.

## Benchmark results (m7a.4xlarge, 16 vCPU)

| Metric       | Before (HEAD) | After (this fix) | Pre-regression baseline |
|------------- |---------------|-------------------|-------------------------|
| handle_100   | 1627 RPS      | 1768 RPS (+8.7%)  | 1775 RPS                |
| handle_800   | 1436 RPS      | 1584 RPS (+10.3%) | 1584 RPS                |
| http         | 475 RPS       | 520 RPS (+9.5%)   | 538 RPS                 |
| http_100     | 710 RPS       | 741 RPS (+4.4%)   | 739 RPS                 |
| http_800     | 882 RPS       | 967 RPS (+9.6%)   | 955 RPS                 |

Signed-off-by: seiji <seiji@anyscale.com>
eicherseiji added a commit to eicherseiji/ray that referenced this pull request Mar 31, 2026
Remove call_soon_threadsafe wrapping from _process_finished_request
by making the happy path thread-safe:

- dec_num_running_requests_for_replica: already uses _queries_lock
- on_request_completed: no-op in base RequestRouter
- cache decrement: GIL-atomic deque.append, drained before routing

Error paths (actor died/unavailable) are deferred to the event loop
via call_soon_threadsafe since they mutate shared replica state.

A single add_done_callback is used per request for both the
no-rejection and with-rejection paths. The with-rejection path uses
a mutable flag set after acceptance to control cache decrement.

Recovers handle_100 throughput to within 1% of pre-ray-project#61755 baseline.

Signed-off-by: seiji <seiji@anyscale.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…equest completion, reduce replica update overhead (ray-project#61755)

Fixes elevated P99 latency observed when scaling Ray Serve deployments
with `max_ongoing_requests=1`. The root cause is that the queue length
cache is incremented when a request is sent (`on_send_request`) but
never decremented when the request completes, causing cache entries to
get "stuck" at values >= `max_ongoing_requests`. This forces every
subsequent routing decision to fall back to blocking probe RPCs instead
of using cached values.

This regression was introduced when `on_send_request` was added in the
router refactor (commit de1494e, Aug 2025). Prior to that (Ray <=
2.10), the cache was only updated from replica-reported values (probes
and rejection protocol responses), so there was no
increment-without-matching-decrement problem.

### Changes

**1. Decrement queue length cache on request completion (primary fix)**

Implements `decrement_queue_len_cache` in `RequestRouter` to decrement
the cache entry by 1 when a request finishes. This restores the
increment/decrement symmetry that was missing since `on_send_request`
was introduced.

With `max_ongoing_requests=1`, the routing algorithm in
`_select_from_candidate_replicas` treats any cache entry >=
`max_ongoing_requests` as "needs probing". Before this fix, every routed
request would bump the cache to 1, and it would stay there until either
the 10s TTL expired or a probe happened to refresh it. This meant the
cache was nearly useless, and most routing decisions required a blocking
probe RPC (~20-40ms round trip), directly explaining the observed P99
increase.

**2. Reuse existing replica wrappers in `_update_running_replicas`**

Previously, every replica update created new `RunningReplica` wrappers
for *all* replicas, even those that hadn't changed. During scaling
storms (100+ updates with 250+ replicas each), this caused O(n)
synchronous work per update on the router's event loop.

Now reuses existing wrappers for known replicas, only creating wrappers
for genuinely new ones. This reduces per-update work from
O(all_replicas) to O(new_replicas).

**3. Reduce replica update log noise**

The "Got updated replicas" log line previously serialized every replica
ID (250+) into a string on every update. Changed to log only the total
count and the added/removed counts, reducing both log volume and the
synchronous formatting cost on the event loop.

## Load Test Results

| Scale | Client | Master QPS | Master P99 Latency | Optimized QPS |
Optimized P99 Latency |

|------|--------|------------|--------------------|---------------|-----------------------|
| **Up to 100 Users** | <img
src="https://github.com/user-attachments/assets/d87573da-9fb8-4c22-b93d-04ce4cac2635"
width="250"> | <img
src="https://github.com/user-attachments/assets/d0de09c8-5932-4ea0-9988-cb798e8d6328"
width="400"> | <img
src="https://github.com/user-attachments/assets/c46dc371-c19d-4e2d-8327-b78648e2c393"
width="400"> | <img
src="https://github.com/user-attachments/assets/fccfd88c-6cd2-4d56-bf32-da5cec54e937"
width="400"> | <img
src="https://github.com/user-attachments/assets/6fc9260d-f164-415d-baa9-da8a837a1d23"
width="400"> |
| **Up to 200 Users** | <img
src="https://github.com/user-attachments/assets/8455e66d-9945-480c-89eb-78c1d3641e4e"
width="250"> | <img
src="https://github.com/user-attachments/assets/5a755c67-3588-4add-bfa5-f7e974f8b547"
width="400"> | <img
src="https://github.com/user-attachments/assets/2a7cfe6d-c660-4433-b350-a2b406421a70"
width="400"> | <img
src="https://github.com/user-attachments/assets/1ecea44d-2eb0-445f-a9e2-70b61eb4eba9"
width="400"> | <img
src="https://github.com/user-attachments/assets/e683bffb-9c45-4755-a86c-c6f39f84568b"
width="400"> |

---------

Signed-off-by: abrar <abrar@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

3 participants