[serve][llm] Replace LLM ingress router replica selection with choose_replica#63280
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the LLMRouter to delegate replica selection to DeploymentHandle.choose_replica, replacing the manual round-robin logic and internal replica caching. The LLMServer deployment is now configured to use the RoundRobinRouter via RequestRouterConfig. Additionally, the temporary _get_request_router method in DeploymentHandle has been removed, and tests have been updated to reflect the new asynchronous routing logic and configuration. I have no feedback to provide.
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
98726a2 to
36ee8f6
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 36ee8f6. Configure here.
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…DER_KEY) Make the HTTP header that drives Serve's session-stickiness routing operator-configurable via the ``RAY_SERVE_SESSION_ID_HEADER_KEY`` env variable (default ``x-session-id``). Without this, ``ConsistentHashRouter`` can only honor a hard-coded ``x-session-id``/``x_session_id`` header, which doesn't match what clients like aiperf (``x-correlation-id``) or vllm-router's ``--request-id-headers`` plumbing send. Changes ------- - ``_private/constants.py``: ``SERVE_SESSION_ID`` now resolves from ``RAY_SERVE_SESSION_ID_HEADER_KEY`` at import time. - ``_private/http_util.py``: new ``_matches_session_id_header`` helper. Case-insensitive and ``-``↔``_`` tolerant, so intermediate proxies (nginx, AWS API Gateway, ...) that rewrite the separator don't silently drop affinity. ``parse_session_id_header`` switches to it. - ``_private/proxy.py``: HTTP proxy uses the same helper to recognize the configured header on the ingress hop. - ``_private/ingress_request_router.lua.tmpl`` + ``_private/haproxy.py``: HAProxy's Lua ``/internal/route`` action now reads the configured header off the client request and forwards it on the same name. Without this, direct-streaming bypasses ``proxy.py`` and ``session_id`` never reaches ``LLMRouter``. - ``llm/_internal/serve/core/ingress/router.py``: ``LLMRouter`` reads the forwarded header and calls ``handle.options(session_id=...).choose_replica(...)`` so session-aware request routers (``ConsistentHashRouter``, ``PrefixCacheAffinityRouter``) actually see the session id on ``RequestMetadata``. - ``llm/_internal/serve/core/ingress/ingress.py``: ``OpenAiIngress`` propagates the session header on the second handle hop (proxy -> ingress is auto, ingress -> LLMServer is not). - ``experimental/consistent_hash_router.py``: docstring + init-time log line confirming the configured header and kwargs. - ``tests/test_http_headers.py``: parametrized unit test for the matcher across separator/case/env-override permutations. Depends on ray-project#63280 (new ``LLMRouter`` shape). Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
…e_replica` (ray-project#63280) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
…uters (#64328) ## Why Serve LLM direct streaming breaks with a content-based router such as `PrefixCacheAffinityRouter`. The trigger is `RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING=1` with `RAY_SERVE_ENABLE_HA_PROXY=1`. See #64326. The router reads `.messages` or `.prompt` off the first positional routing arg, where the OpenAI ingress puts the parsed request. Direct streaming forwards the raw body bytes to `choose_replica` instead, so the router gets no routing key. Requests either hang or silently fall back to load-balancing. Introduced by #63280. It routed direct streaming through `choose_replica` but passed the body as a `request_body=` kwarg rather than a parsable positional arg. ## What Normalize the body at the ingress so the router contract is the same on both paths. - Add `_parse_routing_payload(body)`. It `json.loads` the body and wraps it in a `SimpleNamespace` over every field, gated by a frozen tuple of routing-key fields (`messages`, `prompt`). It returns `None` for an empty, non-object, unparseable, or keyless body. - `route()` passes the namespace to `choose_replica` positionally, where body-aware routers already look. With no key, nothing is forwarded and the router load-balances. - Warn once per replica when no key is derived. `PrefixCacheAffinityRouter` is unchanged. ### Why a namespace over the whole body The router contract is that body-aware routers scan `pending_request.args` for an object exposing a routing field by attribute. The namespace mirrors that at `args[0]`, the same slot and access the normal ingress uses, so the direct path is indistinguishable from the normal path to any router, and custom routers that read request fields keep working unchanged. The discarded alternatives: - A plain dict fails attribute access, since `hasattr(some_dict, "messages")` is `False`, so every attribute-reading router would break. - A routing field on the request metadata would bypass routers that scan `args`. - Reconstructing the typed `ChatCompletionRequest` / `CompletionRequest` works only for the two endpoints we enumerate, and the body alone can't identify the type for the others (`messages` is shared by chat-completion, chat-embedding, and tokenize). The namespace carries every field, so it serves any request type that has a routing key and exposes the rest to custom routers, with no endpoint lookup and no per-type branching. Bodies without a key degrade to load-balancing. ## Tests `test_router.py` covers `_parse_routing_payload`, a contract test through the real `PrefixCacheAffinityRouter._extract_text_from_request`, and `route()` and `_pick_replica` forwarding plus the warn-once path. A release test runs the prefix-cache-aware router with direct streaming end to end. ## Related issue number Fixes #64326 Introduced by #63280 ## Checks - [x] I've signed off every commit (`-s`) - [x] I've run `scripts/format.sh` to lint the changes in this PR - [ ] Testing done by reviewers / CI --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…uters (ray-project#64328) ## Why Serve LLM direct streaming breaks with a content-based router such as `PrefixCacheAffinityRouter`. The trigger is `RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING=1` with `RAY_SERVE_ENABLE_HA_PROXY=1`. See ray-project#64326. The router reads `.messages` or `.prompt` off the first positional routing arg, where the OpenAI ingress puts the parsed request. Direct streaming forwards the raw body bytes to `choose_replica` instead, so the router gets no routing key. Requests either hang or silently fall back to load-balancing. Introduced by ray-project#63280. It routed direct streaming through `choose_replica` but passed the body as a `request_body=` kwarg rather than a parsable positional arg. ## What Normalize the body at the ingress so the router contract is the same on both paths. - Add `_parse_routing_payload(body)`. It `json.loads` the body and wraps it in a `SimpleNamespace` over every field, gated by a frozen tuple of routing-key fields (`messages`, `prompt`). It returns `None` for an empty, non-object, unparseable, or keyless body. - `route()` passes the namespace to `choose_replica` positionally, where body-aware routers already look. With no key, nothing is forwarded and the router load-balances. - Warn once per replica when no key is derived. `PrefixCacheAffinityRouter` is unchanged. ### Why a namespace over the whole body The router contract is that body-aware routers scan `pending_request.args` for an object exposing a routing field by attribute. The namespace mirrors that at `args[0]`, the same slot and access the normal ingress uses, so the direct path is indistinguishable from the normal path to any router, and custom routers that read request fields keep working unchanged. The discarded alternatives: - A plain dict fails attribute access, since `hasattr(some_dict, "messages")` is `False`, so every attribute-reading router would break. - A routing field on the request metadata would bypass routers that scan `args`. - Reconstructing the typed `ChatCompletionRequest` / `CompletionRequest` works only for the two endpoints we enumerate, and the body alone can't identify the type for the others (`messages` is shared by chat-completion, chat-embedding, and tokenize). The namespace carries every field, so it serves any request type that has a routing key and exposes the rest to custom routers, with no endpoint lookup and no per-type branching. Bodies without a key degrade to load-balancing. ## Tests `test_router.py` covers `_parse_routing_payload`, a contract test through the real `PrefixCacheAffinityRouter._extract_text_from_request`, and `route()` and `_pick_replica` forwarding plus the warn-once path. A release test runs the prefix-cache-aware router with direct streaming end to end. ## Related issue number Fixes ray-project#64326 Introduced by ray-project#63280 ## Checks - [x] I've signed off every commit (`-s`) - [x] I've run `scripts/format.sh` to lint the changes in this PR - [ ] Testing done by reviewers / CI --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Description
Removes the custom round-robin pick from
LLMRouterand delegates replica selection toDeploymentHandle.choose_replicaon the underlyingLLMServerdeployment. The pick policy is now whatever the user configured viallm_config.deployment_config.request_router_config; the builder fills inRoundRobinRouteras the default when nothing was set.API
Benchmark
No regression from master's implementation on L4, but there is regression identified on H100. Need to apply the changes in this commit to avoid regression, which @jeffreywang-anyscale will do in a follow-up PR.

Related issues
Additional information