[Serve] Optimize replica routing request data structures#60139
Conversation
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request significantly optimizes the replica routing mechanism in Ray Serve by refactoring data structures and lookup logic. The changes introduce dictionary-based indices (_pending_requests_by_id, _pending_requests_by_model_id) for O(1) lookups of pending requests, replacing previous O(N) iterations over deques. Lazy cleanup of completed futures is implemented to prevent memory leaks, and a cached list of replicas (_replicas_list) is maintained to avoid redundant list conversions. These improvements enhance the efficiency of request matching, fulfillment, and replica selection, leading to better performance, especially in high-throughput or multiplexed model scenarios. The code is well-commented, explaining the rationale behind the optimizations.
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
harshit-anyscale
left a comment
There was a problem hiding this comment.
great improvements, nice work!
left some comments, else LGTM
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in #59233 after #60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>
O(1) Pending Request Lookups
_pending_requests_by_idand_pending_requests_by_model_id) for fast lookupsCached Replica List
_replicas_listcache to avoid O(n) dict-to-list conversion on every routing iterationupdate_replicas()oron_replica_actor_died()Lazy Cleanup Strategy
_pending_requests_by_model_idduring lookups using O(1)popleft()Optimized Retry Insertion
_insert_pending_request_sorted()helperSimplified
pow_2_routerself._replicas[chosen_id]instead of building temporary maprandom.sample→ Direct Selectioncommon.py)request_router.py,constants.py)flamegraph of the router after all the optimization
