[serve][llm] Replace LLM ingress router replica selection with `choose_replica` by jeffreywang88 · Pull Request #63280 · ray-project/ray

jeffreywang88 · 2026-05-11T20:23:40Z

Description

Removes the custom round-robin pick from LLMRouter and delegates replica selection to DeploymentHandle.choose_replica on the underlying LLMServer deployment. The pick policy is now whatever the user configured via llm_config.deployment_config.request_router_config; the builder fills in RoundRobinRouter as the default when nothing was set.

API

Declarative (YAML)

  applications:
  - import_path: ray.serve.llm:build_openai_app
    name: llm_app
    route_prefix: "/"
    args:
      llm_configs:
        - model_loading_config:
            model_id: qwen3-4b
            model_source: Qwen/Qwen3-4B
          deployment_config:
            autoscaling_config: { min_replicas: 8, max_replicas: 8 }
            request_router_config: 
              request_router_class: ray.serve.llm.request_router.PrefixCacheAffinityRouter
              request_router_kwargs:
                match_rate_threshold: 0.1

Imperative (Python)

  llm_config = LLMConfig(
      model_loading_config={"model_id": "qwen3-4b", "model_source": "Qwen/Qwen3-4B"},
      deployment_config={
          "autoscaling_config": {"min_replicas": 8, "max_replicas": 8},
          "request_router_config": RequestRouterConfig(
              request_router_class=PrefixCacheAffinityRouter,
          ),
      },
  )

Benchmark

No regression from master's implementation on L4, but there is regression identified on H100. Need to apply the changes in this commit to avoid regression, which @jeffreywang-anyscale will do in a follow-up PR.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

gemini-code-assist

Code Review

This pull request refactors the LLMRouter to delegate replica selection to DeploymentHandle.choose_replica, replacing the manual round-robin logic and internal replica caching. The LLMServer deployment is now configured to use the RoundRobinRouter via RequestRouterConfig. Additionally, the temporary _get_request_router method in DeploymentHandle has been removed, and tests have been updated to reflect the new asynchronous routing logic and configuration. I have no feedback to provide.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 36ee8f6. Configure here.}

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

…DER_KEY) Make the HTTP header that drives Serve's session-stickiness routing operator-configurable via the ``RAY_SERVE_SESSION_ID_HEADER_KEY`` env variable (default ``x-session-id``). Without this, ``ConsistentHashRouter`` can only honor a hard-coded ``x-session-id``/``x_session_id`` header, which doesn't match what clients like aiperf (``x-correlation-id``) or vllm-router's ``--request-id-headers`` plumbing send. Changes ------- - ``_private/constants.py``: ``SERVE_SESSION_ID`` now resolves from ``RAY_SERVE_SESSION_ID_HEADER_KEY`` at import time. - ``_private/http_util.py``: new ``_matches_session_id_header`` helper. Case-insensitive and ``-``↔``_`` tolerant, so intermediate proxies (nginx, AWS API Gateway, ...) that rewrite the separator don't silently drop affinity. ``parse_session_id_header`` switches to it. - ``_private/proxy.py``: HTTP proxy uses the same helper to recognize the configured header on the ingress hop. - ``_private/ingress_request_router.lua.tmpl`` + ``_private/haproxy.py``: HAProxy's Lua ``/internal/route`` action now reads the configured header off the client request and forwards it on the same name. Without this, direct-streaming bypasses ``proxy.py`` and ``session_id`` never reaches ``LLMRouter``. - ``llm/_internal/serve/core/ingress/router.py``: ``LLMRouter`` reads the forwarded header and calls ``handle.options(session_id=...).choose_replica(...)`` so session-aware request routers (``ConsistentHashRouter``, ``PrefixCacheAffinityRouter``) actually see the session id on ``RequestMetadata``. - ``llm/_internal/serve/core/ingress/ingress.py``: ``OpenAiIngress`` propagates the session header on the second handle hop (proxy -> ingress is auto, ingress -> LLMServer is not). - ``experimental/consistent_hash_router.py``: docstring + init-time log line confirming the configured header and kwargs. - ``tests/test_http_headers.py``: parametrized unit test for the matcher across separator/case/env-override permutations. Depends on ray-project#63280 (new ``LLMRouter`` shape). Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…e_replica` (ray-project#63280) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>

…uters (#64328) ## Why Serve LLM direct streaming breaks with a content-based router such as `PrefixCacheAffinityRouter`. The trigger is `RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING=1` with `RAY_SERVE_ENABLE_HA_PROXY=1`. See #64326. The router reads `.messages` or `.prompt` off the first positional routing arg, where the OpenAI ingress puts the parsed request. Direct streaming forwards the raw body bytes to `choose_replica` instead, so the router gets no routing key. Requests either hang or silently fall back to load-balancing. Introduced by #63280. It routed direct streaming through `choose_replica` but passed the body as a `request_body=` kwarg rather than a parsable positional arg. ## What Normalize the body at the ingress so the router contract is the same on both paths. - Add `_parse_routing_payload(body)`. It `json.loads` the body and wraps it in a `SimpleNamespace` over every field, gated by a frozen tuple of routing-key fields (`messages`, `prompt`). It returns `None` for an empty, non-object, unparseable, or keyless body. - `route()` passes the namespace to `choose_replica` positionally, where body-aware routers already look. With no key, nothing is forwarded and the router load-balances. - Warn once per replica when no key is derived. `PrefixCacheAffinityRouter` is unchanged. ### Why a namespace over the whole body The router contract is that body-aware routers scan `pending_request.args` for an object exposing a routing field by attribute. The namespace mirrors that at `args[0]`, the same slot and access the normal ingress uses, so the direct path is indistinguishable from the normal path to any router, and custom routers that read request fields keep working unchanged. The discarded alternatives: - A plain dict fails attribute access, since `hasattr(some_dict, "messages")` is `False`, so every attribute-reading router would break. - A routing field on the request metadata would bypass routers that scan `args`. - Reconstructing the typed `ChatCompletionRequest` / `CompletionRequest` works only for the two endpoints we enumerate, and the body alone can't identify the type for the others (`messages` is shared by chat-completion, chat-embedding, and tokenize). The namespace carries every field, so it serves any request type that has a routing key and exposes the rest to custom routers, with no endpoint lookup and no per-type branching. Bodies without a key degrade to load-balancing. ## Tests `test_router.py` covers `_parse_routing_payload`, a contract test through the real `PrefixCacheAffinityRouter._extract_text_from_request`, and `route()` and `_pick_replica` forwarding plus the warn-once path. A release test runs the prefix-cache-aware router with direct streaming end to end. ## Related issue number Fixes #64326 Introduced by #63280 ## Checks - [x] I've signed off every commit (`-s`) - [x] I've run `scripts/format.sh` to lint the changes in this PR - [ ] Testing done by reviewers / CI --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…uters (ray-project#64328) ## Why Serve LLM direct streaming breaks with a content-based router such as `PrefixCacheAffinityRouter`. The trigger is `RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING=1` with `RAY_SERVE_ENABLE_HA_PROXY=1`. See ray-project#64326. The router reads `.messages` or `.prompt` off the first positional routing arg, where the OpenAI ingress puts the parsed request. Direct streaming forwards the raw body bytes to `choose_replica` instead, so the router gets no routing key. Requests either hang or silently fall back to load-balancing. Introduced by ray-project#63280. It routed direct streaming through `choose_replica` but passed the body as a `request_body=` kwarg rather than a parsable positional arg. ## What Normalize the body at the ingress so the router contract is the same on both paths. - Add `_parse_routing_payload(body)`. It `json.loads` the body and wraps it in a `SimpleNamespace` over every field, gated by a frozen tuple of routing-key fields (`messages`, `prompt`). It returns `None` for an empty, non-object, unparseable, or keyless body. - `route()` passes the namespace to `choose_replica` positionally, where body-aware routers already look. With no key, nothing is forwarded and the router load-balances. - Warn once per replica when no key is derived. `PrefixCacheAffinityRouter` is unchanged. ### Why a namespace over the whole body The router contract is that body-aware routers scan `pending_request.args` for an object exposing a routing field by attribute. The namespace mirrors that at `args[0]`, the same slot and access the normal ingress uses, so the direct path is indistinguishable from the normal path to any router, and custom routers that read request fields keep working unchanged. The discarded alternatives: - A plain dict fails attribute access, since `hasattr(some_dict, "messages")` is `False`, so every attribute-reading router would break. - A routing field on the request metadata would bypass routers that scan `args`. - Reconstructing the typed `ChatCompletionRequest` / `CompletionRequest` works only for the two endpoints we enumerate, and the body alone can't identify the type for the others (`messages` is shared by chat-completion, chat-embedding, and tokenize). The namespace carries every field, so it serves any request type that has a routing key and exposes the rest to custom routers, with no endpoint lookup and no per-type branching. Bodies without a key degrade to load-balancing. ## Tests `test_router.py` covers `_parse_routing_payload`, a contract test through the real `PrefixCacheAffinityRouter._extract_text_from_request`, and `route()` and `_pick_replica` forwarding plus the warn-once path. A release test runs the prefix-cache-aware router with direct streaming end to end. ## Related issue number Fixes ray-project#64326 Introduced by ray-project#63280 ## Checks - [x] I've signed off every commit (`-s`) - [x] I've run `scripts/format.sh` to lint the changes in this PR - [ ] Testing done by reviewers / CI --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>

jeffreywang88 added the go add ONLY when ready to merge, run all tests label May 11, 2026

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

Base automatically changed from decouple-routing-primitives-3 to master May 12, 2026 03:16

Replace LLM ingress router replica selection with choose_replica

36ee8f6

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha force-pushed the llm-routing-apply-decouple-primitives branch from 98726a2 to 36ee8f6 Compare May 15, 2026 06:36

kouroshHakha marked this pull request as ready for review May 15, 2026 06:37

kouroshHakha requested review from a team as code owners May 15, 2026 06:37

cursor Bot reviewed May 15, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/core/ingress/router.py

kouroshHakha mentioned this pull request May 15, 2026

[serve][llm] Configurable session-id header (RAY_SERVE_SESSION_ID_HEADER_KEY) #63362

Merged

4 tasks

ray-gardener Bot added serve Ray Serve Related Issue llm labels May 15, 2026

Self-review

05349f3

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

kouroshHakha approved these changes May 15, 2026

View reviewed changes

kouroshHakha enabled auto-merge (squash) May 15, 2026 16:41

Merge branch 'master' into llm-routing-apply-decouple-primitives

bcfadf9

github-actions Bot disabled auto-merge May 15, 2026 16:43

kouroshHakha enabled auto-merge (squash) May 15, 2026 17:27

kouroshHakha merged commit 31510bc into master May 15, 2026
7 checks passed

kouroshHakha deleted the llm-routing-apply-decouple-primitives branch May 15, 2026 18:18

eicherseiji mentioned this pull request Jun 4, 2026

[serve][llm][doc] Document direct streaming and ingress request router #63860

Merged

3 tasks

eicherseiji mentioned this pull request Jun 24, 2026

[serve][llm] Parse direct-streaming routing payload for body-aware routers #64328

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[serve][llm] Replace LLM ingress router replica selection with `choose_replica`#63280

[serve][llm] Replace LLM ingress router replica selection with `choose_replica`#63280
kouroshHakha merged 3 commits into
masterfrom
llm-routing-apply-decouple-primitives

jeffreywang88 commented May 11, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

cursor Bot left a comment

Uh oh!

Uh oh!

Labels

2 participants

Uh oh!

Conversation

jeffreywang88 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

API

Benchmark

Related issues

Additional information

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

2 participants

jeffreywang88 commented May 11, 2026 •

edited

Loading