Skip to content

[Serve][LLM] Don't fetch LLMConfig from replicas at ingress startup#63065

Merged
kouroshHakha merged 3 commits into
ray-project:masterfrom
kouroshHakha:kourosh/llm-ingress-no-eager-config-fetch
May 5, 2026
Merged

[Serve][LLM] Don't fetch LLMConfig from replicas at ingress startup#63065
kouroshHakha merged 3 commits into
ray-project:masterfrom
kouroshHakha:kourosh/llm-ingress-no-eager-config-fetch

Conversation

@kouroshHakha

@kouroshHakha kouroshHakha commented May 1, 2026

Copy link
Copy Markdown
Contributor

Bug

With min_replicas: 0, deploying an OpenAI ingress force-starts every LLMServer at deploy time, defeating scale-to-zero. The ingress also can't pass check_health until every model has successfully started at least once, so one broken model blocks the whole service from becoming healthy.

Root cause

OpenAiIngress.__init__ schedules _setup_handle_and_config_maps, which awaits handle.llm_config.remote() for each deployment. With 0 replicas the call queues at the router, bumps ongoing_requests to 1, and the autoscaler scales 0 → 1. The fetched LLMConfig is byte-identical to what build_openai_app already holds in memory.

Fix

Pass static metadata into the ingress directly. OpenAiIngress.__init__ now takes:

  • llm_deployments: Dict[str, DeploymentHandle]
  • model_cards: Dict[str, ModelCard]
  • lora_paths: Optional[Dict[str, str]] = None

build_openai_app, build_dev_openai_app, build_pd_openai_app, and build_dp_openai_app build these dicts up front. get_lora_model_metadata takes base_path: str instead of LLMConfig. No remote llm_config() call at startup; check_health is a no-op.

Test plan

  • pytest python/ray/llm/tests/serve/cpu/deployments/routers/test_router.py
  • pytest python/ray/llm/tests/serve/cpu/deployments/routers/test_dev_ingress.py
  • pytest python/ray/llm/tests/serve/cpu/deployments/llm/multiplex/test_lora_deployment_base_client.py
  • Manually deploy multi-model app with min_replicas: 0, send no traffic, verify no replicas start and service reports healthy.
Bug: With min_replicas=0, deploying an OpenAI ingress force-starts every
LLMServer at startup, defeating scale-to-zero and gating /-/healthz on
every model successfully starting at least once.

Root cause: OpenAiIngress.__init__ schedules _setup_handle_and_config_maps,
which awaits handle.llm_config.remote() per deployment. With 0 replicas,
each call queues at the router, bumps ongoing_requests to 1, and trips
autoscaling 0 -> 1. The fetched LLMConfig is identical to what
build_openai_app already holds in memory.

Fix: Pass static metadata to the ingress directly. OpenAiIngress now takes
llm_deployments: Dict[str, DeploymentHandle], model_cards: Dict[str, ModelCard],
and lora_paths: Optional[Dict[str, str]]. build_openai_app, build_dev_openai_app,
build_pd_openai_app, and build_dp_openai_app construct these dicts up front.
get_lora_model_metadata takes base_path instead of LLMConfig. Tests updated.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha kouroshHakha requested a review from a team as a code owner May 1, 2026 03:38
@kouroshHakha kouroshHakha added the go add ONLY when ready to merge, run all tests label May 1, 2026
Comment thread python/ray/llm/_internal/serve/core/ingress/builder.py

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the OpenAiIngress initialization to accept llm_deployments, model_cards, and lora_paths as dictionaries keyed by model ID, replacing the previous list-based approach. This change removes the asynchronous remote fetching of model configurations during setup, simplifying the ingress lifecycle and improving initialization reliability. Feedback suggests consolidating the duplicated logic for preparing these arguments in build_openai_app and build_dev_openai_app into a shared helper function to improve maintainability.

Comment thread python/ray/llm/_internal/serve/core/ingress/dev_ingress.py
- test_3_tier_graph_structure now indexes the ingress's llm_deployments
  as a Dict keyed by model_id (was a List).
- broadcast.py: use the public AsyncioRouter.request_router property,
  not the underscore field. The property lazily initializes
  _request_router from the populated replica set; the field is None
  until someone explicitly assigns to it. Previously the eager
  llm_config.remote() call in OpenAiIngress.__init__ implicitly
  triggered this assignment via assign_request(); after removing that,
  DevIngress sleep/pause isolation tests were the first to surface
  this latent bug.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label May 1, 2026
DEPLOYMENT_TARGETS (drives `running_replicas_populated`) and
DEPLOYMENT_CONFIG (drives `_request_router_class`, which the lazy
`request_router` property needs) are independent long-polls. Polling
only the former races with the latter — the request_router property
returns None even when replicas are populated. Wait for both.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 652cc31. Configure here.


if request_router is None:
raise RuntimeError("Request router not initialized. No replicas accessible.")
request_router = _get_request_router()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Property access may trigger router init from wrong thread

Low Severity

_get_request_router accesses router._asyncio_router.request_router (a property with lazy initialization) instead of the previous router._asyncio_router._request_router (a direct attribute read). The property's lazy init creates a RequestRouter and calls self._request_router_initialized.set() on an asyncio.Event that was created on a different thread's event loop. Calling set() from the broadcast thread is not thread-safe per asyncio's contract and could race with the event loop thread calling update_deployment_targets.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 652cc31. Configure here.


# Wait for the router to be populated with replicas
# We add a timeout to prevent infinite hanging
router = handle._router

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change meant to be included?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to check what claude did, it was not able to pass the ci/cd tests and now it's passing. will report back :D

@kouroshHakha kouroshHakha May 5, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I don't know why this has been triggered specifically during this pr, but it's basically safe guarding a race condition. without these test_dev_ingress would fail.

@kouroshHakha kouroshHakha enabled auto-merge (squash) May 5, 2026 02:00

@eicherseiji eicherseiji left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kouroshHakha kouroshHakha merged commit 665ed74 into ray-project:master May 5, 2026
7 checks passed
eicherseiji added a commit to eicherseiji/ray that referenced this pull request May 7, 2026
Serve core
- Replace LATE_BOUND_ASGI_APP sentinel with no-arg serve.ingress(): a
  class supplies its own ASGI app via __serve_build_asgi_app__. Drops
  the placeholder FastAPI() and the dual _set_asgi_app() init.
- Hoist _DIRECT_INGRESS_ASGI_REQ_META to a module-level constant so
  /direct ingress dispatch skips a per-request RequestMetadata
  construction and two generate_request_id() UUIDs.

LLM engine
- Add LLMEngine.build_asgi_app() to the engine protocol; VLLMEngine
  implements it. LLMServer.__serve_build_asgi_app__ is a one-line
  delegate to self.engine.build_asgi_app(). Drops the public
  vllm_args/engine_client properties that leaked vLLM internals.

LLM router
- Bind via bind(server=llm_deployment) so the handle is injected as an
  init kwarg; drops the magic-string serve.get_deployment_handle().
- Replace the awaited llm_config.remote() priming RPC with a synchronous
  DeploymentHandle._init() + cached _get_request_router(). Removes the
  60s timeout, the dependency on a method being removed in ray-project#63065, and
  the engine-cold-start coupling.
- check_health() now asserts the request router is non-None as a
  liveness signal.
- Drop GET / (HAProxy never calls it; Serve uses check_health for
  readiness). Keep GET /health as a human-operator convenience.
- Cache the RequestRouter on self at init; precompute (host, port,
  full_id_str) tuples in _ready_endpoints to skip per-request
  formatting; two-tier cache (id(curr_replicas) fast path then
  frozenset(keys) fallback) avoids the frozenset allocation when
  curr_replicas is unchanged.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
kouroshHakha pushed a commit to eicherseiji/ray that referenced this pull request May 7, 2026
Serve core
- Replace LATE_BOUND_ASGI_APP sentinel with no-arg serve.ingress(): a
  class supplies its own ASGI app via __serve_build_asgi_app__. Drops
  the placeholder FastAPI() and the dual _set_asgi_app() init.
- Hoist _DIRECT_INGRESS_ASGI_REQ_META to a module-level constant so
  /direct ingress dispatch skips a per-request RequestMetadata
  construction and two generate_request_id() UUIDs.

LLM engine
- Add LLMEngine.build_asgi_app() to the engine protocol; VLLMEngine
  implements it. LLMServer.__serve_build_asgi_app__ is a one-line
  delegate to self.engine.build_asgi_app(). Drops the public
  vllm_args/engine_client properties that leaked vLLM internals.

LLM router
- Bind via bind(server=llm_deployment) so the handle is injected as an
  init kwarg; drops the magic-string serve.get_deployment_handle().
- Replace the awaited llm_config.remote() priming RPC with a synchronous
  DeploymentHandle._init() + cached _get_request_router(). Removes the
  60s timeout, the dependency on a method being removed in ray-project#63065, and
  the engine-cold-start coupling.
- check_health() now asserts the request router is non-None as a
  liveness signal.
- Drop GET / (HAProxy never calls it; Serve uses check_health for
  readiness). Keep GET /health as a human-operator convenience.
- Cache the RequestRouter on self at init; precompute (host, port,
  full_id_str) tuples in _ready_endpoints to skip per-request
  formatting; two-tier cache (id(curr_replicas) fast path then
  frozenset(keys) fallback) avoids the frozenset allocation when
  curr_replicas is unchanged.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…ay-project#63065)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

2 participants