Skip to content

[Serve] Optimize replica routing request data structures#60139

Merged
abrarsheikh merged 11 commits into
masterfrom
opt-routing
Jan 16, 2026
Merged

[Serve] Optimize replica routing request data structures#60139
abrarsheikh merged 11 commits into
masterfrom
opt-routing

Conversation

@abrarsheikh

@abrarsheikh abrarsheikh commented Jan 14, 2026

Copy link
Copy Markdown
Contributor
  1. O(1) Pending Request Lookups

    • Added dict indices (_pending_requests_by_id and _pending_requests_by_model_id) for fast lookups
    • Replaced O(n) linear scans with O(1) dict lookups when finding requests by ID or multiplexed model
  2. Cached Replica List

    • Added _replicas_list cache to avoid O(n) dict-to-list conversion on every routing iteration
    • List updated only when replicas change via update_replicas() or on_replica_actor_died()
  3. Lazy Cleanup Strategy

    • Done futures are lazily cleaned from _pending_requests_by_model_id during lookups using O(1) popleft()
    • Avoids expensive O(n) removal from deques
  4. Optimized Retry Insertion

    • Extracted sorted insertion logic into _insert_pending_request_sorted() helper
    • O(1) fast path for common case (recent retries append to end)
  5. Simplified pow_2_router

    • Removed redundant dict creation per routing call
    • Direct lookup via self._replicas[chosen_id] instead of building temporary map
image
  1. random.sample → Direct Selection
  2. Lazy Hash Caching (common.py)
  3. Metrics Throttling (request_router.py, constants.py)
image

flamegraph of the router after all the optimization
image

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly optimizes the replica routing mechanism in Ray Serve by refactoring data structures and lookup logic. The changes introduce dictionary-based indices (_pending_requests_by_id, _pending_requests_by_model_id) for O(1) lookups of pending requests, replacing previous O(N) iterations over deques. Lazy cleanup of completed futures is implemented to prevent memory leaks, and a cached list of replicas (_replicas_list) is maintained to avoid redundant list conversions. These improvements enhance the efficiency of request matching, fulfillment, and replica selection, leading to better performance, especially in high-throughput or multiplexed model scenarios. The code is well-commented, explaining the rationale behind the optimizations.

@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Jan 14, 2026
Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh marked this pull request as ready for review January 14, 2026 20:04
@abrarsheikh abrarsheikh requested a review from a team as a code owner January 14, 2026 20:04
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Comment thread python/ray/serve/_private/request_router/request_router.py
Signed-off-by: abrar <abrar@anyscale.com>
@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label Jan 15, 2026
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>

@harshit-anyscale harshit-anyscale left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great improvements, nice work!
left some comments, else LGTM

Comment thread python/ray/serve/_private/request_router/pow_2_router.py Outdated
Comment thread python/ray/serve/_private/request_router/request_router.py Outdated
Comment thread python/ray/serve/_private/request_router/request_router.py Outdated
Comment thread python/ray/serve/_private/request_router/request_router.py
Comment thread python/ray/serve/_private/request_router/request_router.py
Signed-off-by: abrar <abrar@anyscale.com>
Comment thread python/ray/serve/_private/request_router/request_router.py Outdated
Signed-off-by: abrar <abrar@anyscale.com>
Comment thread python/ray/serve/_private/request_router/request_router.py
Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh merged commit 00c877d into master Jan 16, 2026
6 checks passed
@abrarsheikh abrarsheikh deleted the opt-routing branch January 16, 2026 18:10
aslonnie pushed a commit that referenced this pull request Jan 21, 2026
## Why are these changes needed?

The `test_router_queue_len_metric` test was flaky because the router
queue length gauge has a 100ms throttle
(`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates
when they happen too quickly.

When replica initialization sets the gauge to 0 and a request
immediately updates it to 1, the second update may be throttled, causing
the test to see 0 instead of 1.

## Related issue number

Fixes flaky test introduced in #59233 after #60139 added throttling.

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

3 participants