[Serve][LLM] Don't fetch LLMConfig from replicas at ingress startup by kouroshHakha · Pull Request #63065 · ray-project/ray

kouroshHakha · 2026-05-01T03:38:09Z

Bug

With min_replicas: 0, deploying an OpenAI ingress force-starts every LLMServer at deploy time, defeating scale-to-zero. The ingress also can't pass check_health until every model has successfully started at least once, so one broken model blocks the whole service from becoming healthy.

Root cause

OpenAiIngress.__init__ schedules _setup_handle_and_config_maps, which awaits handle.llm_config.remote() for each deployment. With 0 replicas the call queues at the router, bumps ongoing_requests to 1, and the autoscaler scales 0 → 1. The fetched LLMConfig is byte-identical to what build_openai_app already holds in memory.

Fix

Pass static metadata into the ingress directly. OpenAiIngress.__init__ now takes:

llm_deployments: Dict[str, DeploymentHandle]
model_cards: Dict[str, ModelCard]
lora_paths: Optional[Dict[str, str]] = None

build_openai_app, build_dev_openai_app, build_pd_openai_app, and build_dp_openai_app build these dicts up front. get_lora_model_metadata takes base_path: str instead of LLMConfig. No remote llm_config() call at startup; check_health is a no-op.

Test plan

pytest python/ray/llm/tests/serve/cpu/deployments/routers/test_router.py
pytest python/ray/llm/tests/serve/cpu/deployments/routers/test_dev_ingress.py
pytest python/ray/llm/tests/serve/cpu/deployments/llm/multiplex/test_lora_deployment_base_client.py
Manually deploy multi-model app with min_replicas: 0, send no traffic, verify no replicas start and service reports healthy.

Bug: With min_replicas=0, deploying an OpenAI ingress force-starts every LLMServer at startup, defeating scale-to-zero and gating /-/healthz on every model successfully starting at least once. Root cause: OpenAiIngress.__init__ schedules _setup_handle_and_config_maps, which awaits handle.llm_config.remote() per deployment. With 0 replicas, each call queues at the router, bumps ongoing_requests to 1, and trips autoscaling 0 -> 1. The fetched LLMConfig is identical to what build_openai_app already holds in memory. Fix: Pass static metadata to the ingress directly. OpenAiIngress now takes llm_deployments: Dict[str, DeploymentHandle], model_cards: Dict[str, ModelCard], and lora_paths: Optional[Dict[str, str]]. build_openai_app, build_dev_openai_app, build_pd_openai_app, and build_dp_openai_app construct these dicts up front. get_lora_model_metadata takes base_path instead of LLMConfig. Tests updated. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

gemini-code-assist

Code Review

This pull request refactors the OpenAiIngress initialization to accept llm_deployments, model_cards, and lora_paths as dictionaries keyed by model ID, replacing the previous list-based approach. This change removes the asynchronous remote fetching of model configurations during setup, simplifying the ingress lifecycle and improving initialization reliability. Feedback suggests consolidating the duplicated logic for preparing these arguments in build_openai_app and build_dev_openai_app into a shared helper function to improve maintainability.

- test_3_tier_graph_structure now indexes the ingress's llm_deployments as a Dict keyed by model_id (was a List). - broadcast.py: use the public AsyncioRouter.request_router property, not the underscore field. The property lazily initializes _request_router from the populated replica set; the field is None until someone explicitly assigns to it. Previously the eager llm_config.remote() call in OpenAiIngress.__init__ implicitly triggered this assignment via assign_request(); after removing that, DevIngress sleep/pause isolation tests were the first to surface this latent bug. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

DEPLOYMENT_TARGETS (drives `running_replicas_populated`) and DEPLOYMENT_CONFIG (drives `_request_router_class`, which the lazy `request_router` property needs) are independent long-polls. Polling only the former races with the latter — the request_router property returns None even when replicas are populated. Wait for both. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{Reviewed by Cursor Bugbot for commit 652cc31. Configure here.}

cursor · 2026-05-01T21:32:26Z

-
-    if request_router is None:
-        raise RuntimeError("Request router not initialized. No replicas accessible.")
+    request_router = _get_request_router()


Property access may trigger router init from wrong thread

Low Severity

_get_request_router accesses router._asyncio_router.request_router (a property with lazy initialization) instead of the previous router._asyncio_router._request_router (a direct attribute read). The property's lazy init creates a RequestRouter and calls self._request_router_initialized.set() on an asyncio.Event that was created on a different thread's event loop. Calling set() from the broadcast thread is not thread-safe per asyncio's contract and could race with the event loop thread calling update_deployment_targets.

^{Reviewed by Cursor Bugbot for commit 652cc31. Configure here.}

eicherseiji · 2026-05-01T23:17:57Z


-    # Wait for the router to be populated with replicas
-    # We add a timeout to prevent infinite hanging
+    router = handle._router


Is this change meant to be included?

I have to check what claude did, it was not able to pass the ci/cd tests and now it's passing. will report back :D

yeah, I don't know why this has been triggered specifically during this pr, but it's basically safe guarding a race condition. without these test_dev_ingress would fail.

eicherseiji

lgtm

Serve core - Replace LATE_BOUND_ASGI_APP sentinel with no-arg serve.ingress(): a class supplies its own ASGI app via __serve_build_asgi_app__. Drops the placeholder FastAPI() and the dual _set_asgi_app() init. - Hoist _DIRECT_INGRESS_ASGI_REQ_META to a module-level constant so /direct ingress dispatch skips a per-request RequestMetadata construction and two generate_request_id() UUIDs. LLM engine - Add LLMEngine.build_asgi_app() to the engine protocol; VLLMEngine implements it. LLMServer.__serve_build_asgi_app__ is a one-line delegate to self.engine.build_asgi_app(). Drops the public vllm_args/engine_client properties that leaked vLLM internals. LLM router - Bind via bind(server=llm_deployment) so the handle is injected as an init kwarg; drops the magic-string serve.get_deployment_handle(). - Replace the awaited llm_config.remote() priming RPC with a synchronous DeploymentHandle._init() + cached _get_request_router(). Removes the 60s timeout, the dependency on a method being removed in ray-project#63065, and the engine-cold-start coupling. - check_health() now asserts the request router is non-None as a liveness signal. - Drop GET / (HAProxy never calls it; Serve uses check_health for readiness). Keep GET /health as a human-operator convenience. - Cache the RequestRouter on self at init; precompute (host, port, full_id_str) tuples in _ready_endpoints to skip per-request formatting; two-tier cache (id(curr_replicas) fast path then frozenset(keys) fallback) avoids the frozenset allocation when curr_replicas is unchanged. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…ay-project#63065) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha requested a review from a team as a code owner May 1, 2026 03:38

kouroshHakha added the go add ONLY when ready to merge, run all tests label May 1, 2026

cursor Bot reviewed May 1, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/core/ingress/builder.py

gemini-code-assist Bot reviewed May 1, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/core/ingress/dev_ingress.py

ray-gardener Bot added the serve Ray Serve Related Issue label May 1, 2026

cursor Bot reviewed May 1, 2026

View reviewed changes

eicherseiji reviewed May 2, 2026

View reviewed changes

kouroshHakha enabled auto-merge (squash) May 5, 2026 02:00

eicherseiji approved these changes May 5, 2026

View reviewed changes

kouroshHakha merged commit 665ed74 into ray-project:master May 5, 2026
7 checks passed

kouroshHakha mentioned this pull request May 7, 2026

[Serve][5/5] Enable direct streaming for Ray Serve LLM apps #63167

Merged

4 tasks

Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026

[Serve][LLM] Don't fetch LLMConfig from replicas at ingress startup (r…

1fc6beb

…ay-project#63065) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Serve][LLM] Don't fetch LLMConfig from replicas at ingress startup#63065

[Serve][LLM] Don't fetch LLMConfig from replicas at ingress startup#63065
kouroshHakha merged 3 commits into
ray-project:masterfrom
kouroshHakha:kourosh/llm-ingress-no-eager-config-fetch

kouroshHakha commented May 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

cursor Bot left a comment

cursor Bot May 1, 2026

eicherseiji May 1, 2026

kouroshHakha May 2, 2026

kouroshHakha May 5, 2026 •

edited

Loading

eicherseiji left a comment

Uh oh!

Labels

2 participants

Uh oh!

Conversation

kouroshHakha commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug

Root cause

Fix

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

cursor Bot May 1, 2026

Choose a reason for hiding this comment

Property access may trigger router init from wrong thread

eicherseiji May 1, 2026

Choose a reason for hiding this comment

kouroshHakha May 2, 2026

Choose a reason for hiding this comment

kouroshHakha May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

eicherseiji left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

2 participants

kouroshHakha commented May 1, 2026 •

edited

Loading

kouroshHakha May 5, 2026 •

edited

Loading