[Serve][LLM] Wait for request router init in LLMRouter constructor by kouroshHakha · Pull Request #63206 · ray-project/ray

kouroshHakha · 2026-05-07T18:57:31Z

Why this is needed

After #63167, LLMRouter replicas started crashing on startup with:

File "/.../ray/llm/_internal/serve/core/ingress/router.py", line 71, in __init__
    raise RuntimeError(
RuntimeError: DeploymentHandle._get_request_router() returned None after _init(); Serve internals may have changed.

LLMRouter.__init__ called handle._init() then immediately handle._get_request_router() — but the underlying RequestRouter is lazy-created on the first DEPLOYMENT_CONFIG long-poll from the controller (see AsyncioRouter.request_router property in serve/_private/router.py). For a brief window after _init() returns, _get_request_router() is still None, so the constructor was racing the controller and reliably failing replica startup.

The cleanup commits in #63167 dropped the previous await self._handle.llm_config.remote() call that masked this race, because assign_request awaits _request_router_initialized internally — but reading _get_request_router() directly bypasses that synchronization point.

Summary

LLMRouter.__init__ now polls _get_request_router() until it returns non-None (60s timeout, fails the constructor and lets Serve retry on timeout).
Caches the resolved router on self._request_router for the hot path; check_health now consults the cached field.
This matches the existing pattern in python/ray/llm/_internal/serve/utils/broadcast.py:67-90, which already documents the same race for callers that bypass assign_request.

Test plan

pre-commit run --files python/ray/llm/_internal/serve/core/ingress/router.py
Existing unit tests in test_router.py instantiate via LLMRouter.__new__(LLMRouter) and pre-set _request_router, so they're unaffected by the constructor change.
CI (go label).

🤖 Generated with Claude Code

LLMRouter.__init__ called handle._init() then immediately handle._get_request_router(), but the underlying RequestRouter is lazy-created on the first DEPLOYMENT_CONFIG long-poll, so the constructor was racing the controller and reliably failing replica startup with `_get_request_router() returned None after _init()`. Poll for `_get_request_router()` to become non-None (60s timeout) before completing __init__. Mirrors the existing pattern in `utils/broadcast.py`, which calls out the same race for the same reason: callers reading the request router directly bypass `assign_request`, which is where the `_request_router_initialized` event is normally awaited. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

gemini-code-assist

Code Review

This pull request implements a lazy initialization for the RequestRouter using an asynchronous polling mechanism to wait for controller configuration. The review feedback recommends initializing self._request_router to None at the start of the constructor to avoid a potential AttributeError if check_health is called while the initialization is suspended.

gemini-code-assist · 2026-05-07T18:59:12Z

@@ -60,21 +69,36 @@ async def __init__(self, server: DeploymentHandle):
        self._cached_endpoints: List[Tuple[str, int, str]] = []
        self._handle: DeploymentHandle = server


Initialize self._request_router to None before the await call. Since __init__ is an async method, it will be suspended at the await point on line 82. If check_health is invoked by the Serve replica controller while the constructor is suspended, it would raise an AttributeError because the attribute has not yet been defined on the instance. Explicitly initializing it to None ensures that check_health (line 101) can safely access the attribute and raise the intended RuntimeError instead.

Suggested change

self._handle: DeploymentHandle = server

self._handle: DeploymentHandle = server

self._request_router = None

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 2eb18cd. Configure here.}

cursor · 2026-05-07T19:01:10Z


    async def check_health(self):
-        if self._handle._get_request_router() is None:
+        if self._request_router is None:


Health check is dead code after successful initialization

Low Severity

check_health checks if self._request_router is None, but self._request_router is only ever assigned the return value of _wait_for_request_router, which guarantees a non-None result (it raises on timeout). Since Serve only calls check_health on successfully initialized replicas, this condition can never be True, making the method a no-op. The previous code dynamically queried self._handle._get_request_router(), which could theoretically detect if the router became unavailable post-init. The cached check provides no actual health signal.

^{Reviewed by Cursor Bugbot for commit 2eb18cd. Configure here.}

jeffreywang88

Signing off to unblock, but is there an underlying problem in serve that we need to dig into, or is this purely an application problem?

kouroshHakha · 2026-05-07T20:02:49Z

Signing off to unblock, but is there an underlying problem in serve that we need to dig into, or is this purely an application problem?

this is an application bug that slipped through from #63167

jeffreywang88 · 2026-05-07T20:17:35Z

Fixes LGTM.

…uctor (ray-project#63206) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha requested a review from a team as a code owner May 7, 2026 18:57

kouroshHakha added the go add ONLY when ready to merge, run all tests label May 7, 2026

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

cursor Bot reviewed May 7, 2026

View reviewed changes

jeffreywang88 approved these changes May 7, 2026

View reviewed changes

ray-gardener Bot added the serve Ray Serve Related Issue label May 7, 2026

kouroshHakha enabled auto-merge (squash) May 7, 2026 20:03

kouroshHakha merged commit 6a9f86e into ray-project:master May 7, 2026
7 of 8 checks passed

Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026

[Serve][LLM][Bugfix] Wait for request router init in LLMRouter constr…

e18aa5b

…uctor (ray-project#63206) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Serve][LLM] Wait for request router init in LLMRouter constructor#63206

[Serve][LLM] Wait for request router init in LLMRouter constructor#63206
kouroshHakha merged 1 commit into
ray-project:masterfrom
kouroshHakha:fix/llm-router-wait-for-request-router-init

kouroshHakha commented May 7, 2026

gemini-code-assist Bot left a comment

gemini-code-assist Bot May 7, 2026

cursor Bot left a comment

cursor Bot May 7, 2026

jeffreywang88 left a comment •

edited

Loading

kouroshHakha commented May 7, 2026

jeffreywang88 commented May 7, 2026

Uh oh!

Labels

2 participants

		@@ -60,21 +69,36 @@ async def __init__(self, server: DeploymentHandle):
		self._cached_endpoints: List[Tuple[str, int, str]] = []
		self._handle: DeploymentHandle = server

	self._handle: DeploymentHandle = server
	self._handle: DeploymentHandle = server
	self._request_router = None

Uh oh!

Conversation

kouroshHakha commented May 7, 2026

Why this is needed

Summary

Test plan

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

cursor Bot left a comment

Choose a reason for hiding this comment

cursor Bot May 7, 2026

Choose a reason for hiding this comment

Health check is dead code after successful initialization

jeffreywang88 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

kouroshHakha commented May 7, 2026

jeffreywang88 commented May 7, 2026

Uh oh!

Labels

2 participants

jeffreywang88 left a comment •

edited

Loading