Skip to content

[serve][llm] Replace LLM ingress router replica selection with choose_replica#63280

Merged
kouroshHakha merged 3 commits into
masterfrom
llm-routing-apply-decouple-primitives
May 15, 2026
Merged

[serve][llm] Replace LLM ingress router replica selection with choose_replica#63280
kouroshHakha merged 3 commits into
masterfrom
llm-routing-apply-decouple-primitives

Conversation

@jeffreywang88

@jeffreywang88 jeffreywang88 commented May 11, 2026

Copy link
Copy Markdown
Contributor

Description

Removes the custom round-robin pick from LLMRouter and delegates replica selection to DeploymentHandle.choose_replica on the underlying LLMServer deployment. The pick policy is now whatever the user configured via llm_config.deployment_config.request_router_config; the builder fills in RoundRobinRouter as the default when nothing was set.

API

  • Declarative (YAML)
  applications:
  - import_path: ray.serve.llm:build_openai_app
    name: llm_app
    route_prefix: "/"
    args:
      llm_configs:
        - model_loading_config:
            model_id: qwen3-4b
            model_source: Qwen/Qwen3-4B
          deployment_config:
            autoscaling_config: { min_replicas: 8, max_replicas: 8 }
            request_router_config: 
              request_router_class: ray.serve.llm.request_router.PrefixCacheAffinityRouter
              request_router_kwargs:
                match_rate_threshold: 0.1
  • Imperative (Python)
  llm_config = LLMConfig(
      model_loading_config={"model_id": "qwen3-4b", "model_source": "Qwen/Qwen3-4B"},
      deployment_config={
          "autoscaling_config": {"min_replicas": 8, "max_replicas": 8},
          "request_router_config": RequestRouterConfig(
              request_router_class=PrefixCacheAffinityRouter,
          ),
      },
  )

Benchmark

No regression from master's implementation on L4, but there is regression identified on H100. Need to apply the changes in this commit to avoid regression, which @jeffreywang-anyscale will do in a follow-up PR.
Screenshot 2026-05-11 at 4 12 50 PM

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@jeffreywang88 jeffreywang88 added the go add ONLY when ready to merge, run all tests label May 11, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the LLMRouter to delegate replica selection to DeploymentHandle.choose_replica, replacing the manual round-robin logic and internal replica caching. The LLMServer deployment is now configured to use the RoundRobinRouter via RequestRouterConfig. Additionally, the temporary _get_request_router method in DeploymentHandle has been removed, and tests have been updated to reflect the new asynchronous routing logic and configuration. I have no feedback to provide.

Base automatically changed from decouple-routing-primitives-3 to master May 12, 2026 03:16
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha kouroshHakha force-pushed the llm-routing-apply-decouple-primitives branch from 98726a2 to 36ee8f6 Compare May 15, 2026 06:36
@kouroshHakha kouroshHakha marked this pull request as ready for review May 15, 2026 06:37
@kouroshHakha kouroshHakha requested review from a team as code owners May 15, 2026 06:37

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 36ee8f6. Configure here.

Comment thread python/ray/llm/_internal/serve/core/ingress/router.py
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@kouroshHakha kouroshHakha enabled auto-merge (squash) May 15, 2026 16:41
@github-actions github-actions Bot disabled auto-merge May 15, 2026 16:43
@kouroshHakha kouroshHakha enabled auto-merge (squash) May 15, 2026 17:27
@kouroshHakha kouroshHakha merged commit 31510bc into master May 15, 2026
7 checks passed
@kouroshHakha kouroshHakha deleted the llm-routing-apply-decouple-primitives branch May 15, 2026 18:18
kouroshHakha added a commit to kouroshHakha/ray that referenced this pull request May 15, 2026
…DER_KEY)

Make the HTTP header that drives Serve's session-stickiness routing
operator-configurable via the ``RAY_SERVE_SESSION_ID_HEADER_KEY`` env
variable (default ``x-session-id``). Without this, ``ConsistentHashRouter``
can only honor a hard-coded ``x-session-id``/``x_session_id`` header,
which doesn't match what clients like aiperf (``x-correlation-id``) or
vllm-router's ``--request-id-headers`` plumbing send.

Changes
-------
- ``_private/constants.py``: ``SERVE_SESSION_ID`` now resolves from
  ``RAY_SERVE_SESSION_ID_HEADER_KEY`` at import time.
- ``_private/http_util.py``: new ``_matches_session_id_header`` helper.
  Case-insensitive and ``-``↔``_`` tolerant, so intermediate proxies
  (nginx, AWS API Gateway, ...) that rewrite the separator don't
  silently drop affinity. ``parse_session_id_header`` switches to it.
- ``_private/proxy.py``: HTTP proxy uses the same helper to recognize
  the configured header on the ingress hop.
- ``_private/ingress_request_router.lua.tmpl`` +
  ``_private/haproxy.py``: HAProxy's Lua ``/internal/route`` action now
  reads the configured header off the client request and forwards it on
  the same name. Without this, direct-streaming bypasses ``proxy.py``
  and ``session_id`` never reaches ``LLMRouter``.
- ``llm/_internal/serve/core/ingress/router.py``: ``LLMRouter`` reads
  the forwarded header and calls
  ``handle.options(session_id=...).choose_replica(...)`` so
  session-aware request routers (``ConsistentHashRouter``,
  ``PrefixCacheAffinityRouter``) actually see the session id on
  ``RequestMetadata``.
- ``llm/_internal/serve/core/ingress/ingress.py``: ``OpenAiIngress``
  propagates the session header on the second handle hop (proxy ->
  ingress is auto, ingress -> LLMServer is not).
- ``experimental/consistent_hash_router.py``: docstring +
  init-time log line confirming the configured header and kwargs.
- ``tests/test_http_headers.py``: parametrized unit test for the
  matcher across separator/case/env-override permutations.

Depends on ray-project#63280 (new ``LLMRouter`` shape).

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
TruongQuangPhat pushed a commit to cyhapun/ray-fix-issue that referenced this pull request May 27, 2026
…e_replica` (ray-project#63280)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
eicherseiji added a commit that referenced this pull request Jun 25, 2026
…uters (#64328)

## Why

Serve LLM direct streaming breaks with a content-based router such as
`PrefixCacheAffinityRouter`. The trigger is
`RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING=1` with
`RAY_SERVE_ENABLE_HA_PROXY=1`. See #64326.

The router reads `.messages` or `.prompt` off the first positional
routing arg, where the OpenAI ingress puts the parsed request. Direct
streaming forwards the raw body bytes to `choose_replica` instead, so
the router gets no routing key. Requests either hang or silently fall
back to load-balancing.

Introduced by #63280. It routed direct streaming through
`choose_replica` but passed the body as a `request_body=` kwarg rather
than a parsable positional arg.

## What

Normalize the body at the ingress so the router contract is the same on
both paths.

- Add `_parse_routing_payload(body)`. It `json.loads` the body and wraps
it in a `SimpleNamespace` over every field, gated by a frozen tuple of
routing-key fields (`messages`, `prompt`). It returns `None` for an
empty, non-object, unparseable, or keyless body.
- `route()` passes the namespace to `choose_replica` positionally, where
body-aware routers already look. With no key, nothing is forwarded and
the router load-balances.
- Warn once per replica when no key is derived.

`PrefixCacheAffinityRouter` is unchanged.

### Why a namespace over the whole body

The router contract is that body-aware routers scan
`pending_request.args` for an object exposing a routing field by
attribute. The namespace mirrors that at `args[0]`, the same slot and
access the normal ingress uses, so the direct path is indistinguishable
from the normal path to any router, and custom routers that read request
fields keep working unchanged.

The discarded alternatives:

- A plain dict fails attribute access, since `hasattr(some_dict,
"messages")` is `False`, so every attribute-reading router would break.
- A routing field on the request metadata would bypass routers that scan
`args`.
- Reconstructing the typed `ChatCompletionRequest` / `CompletionRequest`
works only for the two endpoints we enumerate, and the body alone can't
identify the type for the others (`messages` is shared by
chat-completion, chat-embedding, and tokenize). The namespace carries
every field, so it serves any request type that has a routing key and
exposes the rest to custom routers, with no endpoint lookup and no
per-type branching. Bodies without a key degrade to load-balancing.

## Tests

`test_router.py` covers `_parse_routing_payload`, a contract test
through the real `PrefixCacheAffinityRouter._extract_text_from_request`,
and `route()` and `_pick_replica` forwarding plus the warn-once path. A
release test runs the prefix-cache-aware router with direct streaming
end to end.

## Related issue number

Fixes #64326
Introduced by #63280

## Checks

- [x] I've signed off every commit (`-s`)
- [x] I've run `scripts/format.sh` to lint the changes in this PR
- [ ] Testing done by reviewers / CI

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
…uters (ray-project#64328)

## Why

Serve LLM direct streaming breaks with a content-based router such as
`PrefixCacheAffinityRouter`. The trigger is
`RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING=1` with
`RAY_SERVE_ENABLE_HA_PROXY=1`. See ray-project#64326.

The router reads `.messages` or `.prompt` off the first positional
routing arg, where the OpenAI ingress puts the parsed request. Direct
streaming forwards the raw body bytes to `choose_replica` instead, so
the router gets no routing key. Requests either hang or silently fall
back to load-balancing.

Introduced by ray-project#63280. It routed direct streaming through
`choose_replica` but passed the body as a `request_body=` kwarg rather
than a parsable positional arg.

## What

Normalize the body at the ingress so the router contract is the same on
both paths.

- Add `_parse_routing_payload(body)`. It `json.loads` the body and wraps
it in a `SimpleNamespace` over every field, gated by a frozen tuple of
routing-key fields (`messages`, `prompt`). It returns `None` for an
empty, non-object, unparseable, or keyless body.
- `route()` passes the namespace to `choose_replica` positionally, where
body-aware routers already look. With no key, nothing is forwarded and
the router load-balances.
- Warn once per replica when no key is derived.

`PrefixCacheAffinityRouter` is unchanged.

### Why a namespace over the whole body

The router contract is that body-aware routers scan
`pending_request.args` for an object exposing a routing field by
attribute. The namespace mirrors that at `args[0]`, the same slot and
access the normal ingress uses, so the direct path is indistinguishable
from the normal path to any router, and custom routers that read request
fields keep working unchanged.

The discarded alternatives:

- A plain dict fails attribute access, since `hasattr(some_dict,
"messages")` is `False`, so every attribute-reading router would break.
- A routing field on the request metadata would bypass routers that scan
`args`.
- Reconstructing the typed `ChatCompletionRequest` / `CompletionRequest`
works only for the two endpoints we enumerate, and the body alone can't
identify the type for the others (`messages` is shared by
chat-completion, chat-embedding, and tokenize). The namespace carries
every field, so it serves any request type that has a routing key and
exposes the rest to custom routers, with no endpoint lookup and no
per-type branching. Bodies without a key degrade to load-balancing.

## Tests

`test_router.py` covers `_parse_routing_payload`, a contract test
through the real `PrefixCacheAffinityRouter._extract_text_from_request`,
and `route()` and `_pick_replica` forwarding plus the warn-once path. A
release test runs the prefix-cache-aware router with direct streaming
end to end.

## Related issue number

Fixes ray-project#64326
Introduced by ray-project#63280

## Checks

- [x] I've signed off every commit (`-s`)
- [x] I've run `scripts/format.sh` to lint the changes in this PR
- [ ] Testing done by reviewers / CI

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests llm serve Ray Serve Related Issue

2 participants