[serve][3/N] Introduce experimental ConsistentHashRouter for session-sticky routing#62906
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a ConsistentHashRouter to Ray Serve, providing session stickiness via a consistent-hash ring with virtual nodes. The implementation supports fallback replicas during backpressure and maintains affinity during scaling. The PR also includes extensive unit and integration tests, refactors test utilities, and adds the mmh3 dependency. Review feedback suggests optimizing the ring unzipping logic and the ranked replica lookup loop for better performance and idiomatic Python usage.
|
@cursor review |
e360b97 to
ffdbd63
Compare
ConsistentHashRouter for session-sticky routing…yers (#62905) ## Summary Adds a new `session_id` field that flows from the client to `RequestMetadata`, giving session-aware request routers a stable key to hash on. In the follow-up [PR](#62906), we introduce a new router that applies consistent hashing based on `session_id`. No router consumes `session_id` yet. This PR is pure plumbing -- behavior is unchanged. ## API ### 1. Python handle: `handle.options(session_id=...)` ```python handle.options(session_id="user_123").remote(data) ``` Threaded through `DynamicHandleOptions.session_id` → `get_request_metadata` → `RequestMetadata.session_id`. ### 2. HTTP: `x-session-id` header ``` GET /chat HTTP/1.1 x-session-id: user_123 ``` Extracted in `HTTPProxy.setup_request_context_and_handle`. Case-insensitive, accepts both `x-session-id` and `x_session_id`. ### 3. gRPC: `session_id` invocation metadata ```python stub.__call__.with_call( request=req, metadata=(("session_id", "user_123"),), ) ``` ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
d2a68e5 to
c582bd4
Compare
## Description We need a fast, deterministic hashing algorithm with good avalanche and uniformity, and `mmh3`, i.e. `MurmurHash3`, has been proven as a good fit. For example, Cassandra uses `MurmurHash3` for partition tokens ([reference](https://javadoc.io/static/org.apache.cassandra/cassandra-all/3.11.4/org/apache/cassandra/dht/Murmur3Partitioner.html)). Next PR #62906 uses `mmh3` to implement a consistent-hashing based router to satisfy session affinity. ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…e-session bursts Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
c582bd4 to
56d71d0
Compare
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d3e3ff9. Configure here.
| pending_request.future.done() | ||
| and len(self._routing_tasks) > self.target_num_routing_tasks | ||
| ): | ||
| break |
There was a problem hiding this comment.
Routing task probes cancelled request indefinitely when under target
Medium Severity
The break condition requires both pending_request.future.done() AND len(self._routing_tasks) > self.target_num_routing_tasks. When a request is externally cancelled but routing tasks are at or below target, the routing task continues probing replicas indefinitely (with backoff sleeps) for a request nobody is waiting for. Since the non-FIFO _fulfill_next_pending_request cannot reassign a found replica to another request, any probed replica with capacity is simply wasted. Under sustained load with cancellations and saturated replicas, this can keep a routing task slot occupied for extended periods, blocking real pending requests from being routed.
Reviewed by Cursor Bugbot for commit d3e3ff9. Configure here.
…yers (ray-project#62905) ## Summary Adds a new `session_id` field that flows from the client to `RequestMetadata`, giving session-aware request routers a stable key to hash on. In the follow-up [PR](ray-project#62906), we introduce a new router that applies consistent hashing based on `session_id`. No router consumes `session_id` yet. This PR is pure plumbing -- behavior is unchanged. ## API ### 1. Python handle: `handle.options(session_id=...)` ```python handle.options(session_id="user_123").remote(data) ``` Threaded through `DynamicHandleOptions.session_id` → `get_request_metadata` → `RequestMetadata.session_id`. ### 2. HTTP: `x-session-id` header ``` GET /chat HTTP/1.1 x-session-id: user_123 ``` Extracted in `HTTPProxy.setup_request_context_and_handle`. Case-insensitive, accepts both `x-session-id` and `x_session_id`. ### 3. gRPC: `session_id` invocation metadata ```python stub.__call__.with_call( request=req, metadata=(("session_id", "user_123"),), ) ``` ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
## Description We need a fast, deterministic hashing algorithm with good avalanche and uniformity, and `mmh3`, i.e. `MurmurHash3`, has been proven as a good fit. For example, Cassandra uses `MurmurHash3` for partition tokens ([reference](https://javadoc.io/static/org.apache.cassandra/cassandra-all/3.11.4/org/apache/cassandra/dht/Murmur3Partitioner.html)). Next PR ray-project#62906 uses `mmh3` to implement a consistent-hashing based router to satisfy session affinity. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…n-sticky routing (ray-project#62906) ## Summary Adds `ConsistentHashRouter`, an experimental subclass of `RequestRouter` that maps `session_id` → replica via a consistent-hash ring for sticky-session request routing. **Client must send `session_id` along with its request header** to benefit from session-stickiness. Preceding PRs: - Plumbing: ray-project#62905 - `mmh3` dependency introduction: ray-project#63096 ## Changes - Build a ring with V=100 virtual nodes per replica. When the assigned replica rejects the request, walk clockwise for up to K=2 fallback replicas. - Route session-less requests through the same ring using `internal_request_id`. **Do not** fall back to power-of-two-choices. - Rebuild the ring only when the replica set changes. ## Implementation gotchas - `choose_replicas` returns `[[primary], [fallback_1], [fallback_2]]` instead of one multi-element rank; otherwise the framework's `_select_from_candidate_replicas` would pick the lowest-queue-length replica, defeating stickiness. - Override `_fulfill_pending_requests`: `ConsistentHashRouter` cannot safely use `RequestRouter`'s FIFO-style task-shedding behavior. Once a routing task pops a request from `_pending_requests_to_route`, that task owns the request metadata needed to compute the consistent-hash replica. If the base loop exits early because there are “too many” routing tasks, the popped request can remain unfulfilled but no longer be available for another task to route. Pow-2 can recover from that with FIFO fallback. Consistent hashing cannot, because assigning a replica chosen for one request/session to a different pending request would break stickiness. Therefore, the override enforces: if a task pops a request, it must keep trying until that exact request is fulfilled. ## Opt-in API ```python @serve.deployment( request_router_config=RequestRouterConfig( request_router_class=( "ray.serve.experimental.consistent_hash_router:ConsistentHashRouter" ), request_router_kwargs={"num_virtual_nodes": 100, "num_fallback_replicas": 2}, ), ) class SessionAwareDeployment: ... ``` ## Benchmarks ### Performance comparison w/ `PowerOfTwoChoicesRouter` No overhead over power-of-two-choices router. <img width="1183" height="574" alt="Screenshot 2026-05-05 at 5 33 04 PM" src="https://github.com/user-attachments/assets/0feacc4b-1700-4336-af12-ce31604bed64" /> <img width="1185" height="583" alt="Screenshot 2026-05-05 at 5 33 18 PM" src="https://github.com/user-attachments/assets/23f26f57-93fa-4a05-9639-b5b97a77db99" /> ### Correctness During a scaling event, the session affinity rate drops by `M/N+1` because `M/N+1` sessions are re-assigned to different replicas. <img width="1522" height="698" alt="Screenshot 2026-05-01 at 5 10 26 PM" src="https://github.com/user-attachments/assets/f1a2f8a9-b038-45fd-8c86-2460db098110" /> ### Effectiveness -- LLM session affinity <img width="1117" height="362" alt="Screenshot 2026-05-06 at 2 21 35 PM" src="https://github.com/user-attachments/assets/730d33f7-d41b-4936-9736-b289fa627c6f" /> ### Replica assignment distribution With a higher number of virtual nodes, the request -> replica assignment is more uniform. <img width="1325" height="673" alt="Screenshot 2026-05-04 at 3 01 20 PM" src="https://github.com/user-attachments/assets/3be2630d-afe9-4f8f-8f31-a5606553a8af" /> ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>


Summary
Adds
ConsistentHashRouter, an experimental subclass ofRequestRouterthat mapssession_id→ replica via a consistent-hash ring for sticky-session request routing. Client must sendsession_idalong with its request header to benefit from session-stickiness.Preceding PRs:
session_idthrough request metadata and proxy layers #62905mmh3dependency introduction: [serve][2/N] Addmmh3for consistent hashing #63096Changes
internal_request_id. Do not fall back to power-of-two-choices.Implementation gotchas
choose_replicasreturns[[primary], [fallback_1], [fallback_2]]instead of one multi-element rank; otherwise the framework's_select_from_candidate_replicaswould pick the lowest-queue-length replica, defeating stickiness._fulfill_pending_requests:ConsistentHashRoutercannot safely useRequestRouter's FIFO-style task-shedding behavior. Once a routing task pops a request from_pending_requests_to_route, that task owns the request metadata needed to compute the consistent-hash replica. If the base loop exits early because there are “too many” routing tasks, the popped request can remain unfulfilled but no longer be available for another task to route. Pow-2 can recover from that with FIFO fallback. Consistent hashing cannot, because assigning a replica chosen for one request/session to a different pending request would break stickiness. Therefore, the override enforces: if a task pops a request, it must keep trying until that exact request is fulfilled.Opt-in API
Benchmarks
Performance comparison w/
PowerOfTwoChoicesRouterNo overhead over power-of-two-choices router.
Correctness
During a scaling event, the session affinity rate drops by

M/N+1becauseM/N+1sessions are re-assigned to different replicas.Effectiveness -- LLM session affinity
Replica assignment distribution
With a higher number of virtual nodes, the request -> replica assignment is more uniform.

Related issues
Additional information