Skip to content

[Serve][LLM] Wait for request router init in LLMRouter constructor#63206

Merged
kouroshHakha merged 1 commit into
ray-project:masterfrom
kouroshHakha:fix/llm-router-wait-for-request-router-init
May 7, 2026
Merged

[Serve][LLM] Wait for request router init in LLMRouter constructor#63206
kouroshHakha merged 1 commit into
ray-project:masterfrom
kouroshHakha:fix/llm-router-wait-for-request-router-init

Conversation

@kouroshHakha

Copy link
Copy Markdown
Contributor

Why this is needed

After #63167, LLMRouter replicas started crashing on startup with:

File "/.../ray/llm/_internal/serve/core/ingress/router.py", line 71, in __init__
    raise RuntimeError(
RuntimeError: DeploymentHandle._get_request_router() returned None after _init(); Serve internals may have changed.

LLMRouter.__init__ called handle._init() then immediately handle._get_request_router() — but the underlying RequestRouter is lazy-created on the first DEPLOYMENT_CONFIG long-poll from the controller (see AsyncioRouter.request_router property in serve/_private/router.py). For a brief window after _init() returns, _get_request_router() is still None, so the constructor was racing the controller and reliably failing replica startup.

The cleanup commits in #63167 dropped the previous await self._handle.llm_config.remote() call that masked this race, because assign_request awaits _request_router_initialized internally — but reading _get_request_router() directly bypasses that synchronization point.

Summary

  • LLMRouter.__init__ now polls _get_request_router() until it returns non-None (60s timeout, fails the constructor and lets Serve retry on timeout).
  • Caches the resolved router on self._request_router for the hot path; check_health now consults the cached field.
  • This matches the existing pattern in python/ray/llm/_internal/serve/utils/broadcast.py:67-90, which already documents the same race for callers that bypass assign_request.

Test plan

  • pre-commit run --files python/ray/llm/_internal/serve/core/ingress/router.py
  • Existing unit tests in test_router.py instantiate via LLMRouter.__new__(LLMRouter) and pre-set _request_router, so they're unaffected by the constructor change.
  • CI (go label).

🤖 Generated with Claude Code

LLMRouter.__init__ called handle._init() then immediately
handle._get_request_router(), but the underlying RequestRouter is
lazy-created on the first DEPLOYMENT_CONFIG long-poll, so the
constructor was racing the controller and reliably failing replica
startup with `_get_request_router() returned None after _init()`.

Poll for `_get_request_router()` to become non-None (60s timeout)
before completing __init__. Mirrors the existing pattern in
`utils/broadcast.py`, which calls out the same race for the same
reason: callers reading the request router directly bypass
`assign_request`, which is where the `_request_router_initialized`
event is normally awaited.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha kouroshHakha requested a review from a team as a code owner May 7, 2026 18:57
@kouroshHakha kouroshHakha added the go add ONLY when ready to merge, run all tests label May 7, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a lazy initialization for the RequestRouter using an asynchronous polling mechanism to wait for controller configuration. The review feedback recommends initializing self._request_router to None at the start of the constructor to avoid a potential AttributeError if check_health is called while the initialization is suspended.

@@ -60,21 +69,36 @@ async def __init__(self, server: DeploymentHandle):
self._cached_endpoints: List[Tuple[str, int, str]] = []
self._handle: DeploymentHandle = server

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Initialize self._request_router to None before the await call. Since __init__ is an async method, it will be suspended at the await point on line 82. If check_health is invoked by the Serve replica controller while the constructor is suspended, it would raise an AttributeError because the attribute has not yet been defined on the instance. Explicitly initializing it to None ensures that check_health (line 101) can safely access the attribute and raise the intended RuntimeError instead.

Suggested change
self._handle: DeploymentHandle = server
self._handle: DeploymentHandle = server
self._request_router = None

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 2eb18cd. Configure here.


async def check_health(self):
if self._handle._get_request_router() is None:
if self._request_router is None:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Health check is dead code after successful initialization

Low Severity

check_health checks if self._request_router is None, but self._request_router is only ever assigned the return value of _wait_for_request_router, which guarantees a non-None result (it raises on timeout). Since Serve only calls check_health on successfully initialized replicas, this condition can never be True, making the method a no-op. The previous code dynamically queried self._handle._get_request_router(), which could theoretically detect if the router became unavailable post-init. The cached check provides no actual health signal.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2eb18cd. Configure here.

@jeffreywang88 jeffreywang88 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signing off to unblock, but is there an underlying problem in serve that we need to dig into, or is this purely an application problem?

@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label May 7, 2026
@kouroshHakha

Copy link
Copy Markdown
Contributor Author

Signing off to unblock, but is there an underlying problem in serve that we need to dig into, or is this purely an application problem?

this is an application bug that slipped through from #63167

@kouroshHakha kouroshHakha enabled auto-merge (squash) May 7, 2026 20:03
@jeffreywang88

Copy link
Copy Markdown
Contributor

Fixes LGTM.

@kouroshHakha kouroshHakha merged commit 6a9f86e into ray-project:master May 7, 2026
7 of 8 checks passed
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…uctor (ray-project#63206)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

2 participants