[Serve][LLM] Don't fetch LLMConfig from replicas at ingress startup#63065
Conversation
Bug: With min_replicas=0, deploying an OpenAI ingress force-starts every LLMServer at startup, defeating scale-to-zero and gating /-/healthz on every model successfully starting at least once. Root cause: OpenAiIngress.__init__ schedules _setup_handle_and_config_maps, which awaits handle.llm_config.remote() per deployment. With 0 replicas, each call queues at the router, bumps ongoing_requests to 1, and trips autoscaling 0 -> 1. The fetched LLMConfig is identical to what build_openai_app already holds in memory. Fix: Pass static metadata to the ingress directly. OpenAiIngress now takes llm_deployments: Dict[str, DeploymentHandle], model_cards: Dict[str, ModelCard], and lora_paths: Optional[Dict[str, str]]. build_openai_app, build_dev_openai_app, build_pd_openai_app, and build_dp_openai_app construct these dicts up front. get_lora_model_metadata takes base_path instead of LLMConfig. Tests updated. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the OpenAiIngress initialization to accept llm_deployments, model_cards, and lora_paths as dictionaries keyed by model ID, replacing the previous list-based approach. This change removes the asynchronous remote fetching of model configurations during setup, simplifying the ingress lifecycle and improving initialization reliability. Feedback suggests consolidating the duplicated logic for preparing these arguments in build_openai_app and build_dev_openai_app into a shared helper function to improve maintainability.
- test_3_tier_graph_structure now indexes the ingress's llm_deployments as a Dict keyed by model_id (was a List). - broadcast.py: use the public AsyncioRouter.request_router property, not the underscore field. The property lazily initializes _request_router from the populated replica set; the field is None until someone explicitly assigns to it. Previously the eager llm_config.remote() call in OpenAiIngress.__init__ implicitly triggered this assignment via assign_request(); after removing that, DevIngress sleep/pause isolation tests were the first to surface this latent bug. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
DEPLOYMENT_TARGETS (drives `running_replicas_populated`) and DEPLOYMENT_CONFIG (drives `_request_router_class`, which the lazy `request_router` property needs) are independent long-polls. Polling only the former races with the latter — the request_router property returns None even when replicas are populated. Wait for both. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Reviewed by Cursor Bugbot for commit 652cc31. Configure here.
|
|
||
| if request_router is None: | ||
| raise RuntimeError("Request router not initialized. No replicas accessible.") | ||
| request_router = _get_request_router() |
There was a problem hiding this comment.
Property access may trigger router init from wrong thread
Low Severity
_get_request_router accesses router._asyncio_router.request_router (a property with lazy initialization) instead of the previous router._asyncio_router._request_router (a direct attribute read). The property's lazy init creates a RequestRouter and calls self._request_router_initialized.set() on an asyncio.Event that was created on a different thread's event loop. Calling set() from the broadcast thread is not thread-safe per asyncio's contract and could race with the event loop thread calling update_deployment_targets.
Reviewed by Cursor Bugbot for commit 652cc31. Configure here.
|
|
||
| # Wait for the router to be populated with replicas | ||
| # We add a timeout to prevent infinite hanging | ||
| router = handle._router |
There was a problem hiding this comment.
Is this change meant to be included?
There was a problem hiding this comment.
I have to check what claude did, it was not able to pass the ci/cd tests and now it's passing. will report back :D
There was a problem hiding this comment.
yeah, I don't know why this has been triggered specifically during this pr, but it's basically safe guarding a race condition. without these test_dev_ingress would fail.
Serve core - Replace LATE_BOUND_ASGI_APP sentinel with no-arg serve.ingress(): a class supplies its own ASGI app via __serve_build_asgi_app__. Drops the placeholder FastAPI() and the dual _set_asgi_app() init. - Hoist _DIRECT_INGRESS_ASGI_REQ_META to a module-level constant so /direct ingress dispatch skips a per-request RequestMetadata construction and two generate_request_id() UUIDs. LLM engine - Add LLMEngine.build_asgi_app() to the engine protocol; VLLMEngine implements it. LLMServer.__serve_build_asgi_app__ is a one-line delegate to self.engine.build_asgi_app(). Drops the public vllm_args/engine_client properties that leaked vLLM internals. LLM router - Bind via bind(server=llm_deployment) so the handle is injected as an init kwarg; drops the magic-string serve.get_deployment_handle(). - Replace the awaited llm_config.remote() priming RPC with a synchronous DeploymentHandle._init() + cached _get_request_router(). Removes the 60s timeout, the dependency on a method being removed in ray-project#63065, and the engine-cold-start coupling. - check_health() now asserts the request router is non-None as a liveness signal. - Drop GET / (HAProxy never calls it; Serve uses check_health for readiness). Keep GET /health as a human-operator convenience. - Cache the RequestRouter on self at init; precompute (host, port, full_id_str) tuples in _ready_endpoints to skip per-request formatting; two-tier cache (id(curr_replicas) fast path then frozenset(keys) fallback) avoids the frozenset allocation when curr_replicas is unchanged. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Serve core - Replace LATE_BOUND_ASGI_APP sentinel with no-arg serve.ingress(): a class supplies its own ASGI app via __serve_build_asgi_app__. Drops the placeholder FastAPI() and the dual _set_asgi_app() init. - Hoist _DIRECT_INGRESS_ASGI_REQ_META to a module-level constant so /direct ingress dispatch skips a per-request RequestMetadata construction and two generate_request_id() UUIDs. LLM engine - Add LLMEngine.build_asgi_app() to the engine protocol; VLLMEngine implements it. LLMServer.__serve_build_asgi_app__ is a one-line delegate to self.engine.build_asgi_app(). Drops the public vllm_args/engine_client properties that leaked vLLM internals. LLM router - Bind via bind(server=llm_deployment) so the handle is injected as an init kwarg; drops the magic-string serve.get_deployment_handle(). - Replace the awaited llm_config.remote() priming RPC with a synchronous DeploymentHandle._init() + cached _get_request_router(). Removes the 60s timeout, the dependency on a method being removed in ray-project#63065, and the engine-cold-start coupling. - check_health() now asserts the request router is non-None as a liveness signal. - Drop GET / (HAProxy never calls it; Serve uses check_health for readiness). Keep GET /health as a human-operator convenience. - Cache the RequestRouter on self at init; precompute (host, port, full_id_str) tuples in _ready_endpoints to skip per-request formatting; two-tier cache (id(curr_replicas) fast path then frozenset(keys) fallback) avoids the frozenset allocation when curr_replicas is unchanged. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ay-project#63065) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>


Bug
With
min_replicas: 0, deploying an OpenAI ingress force-starts everyLLMServerat deploy time, defeating scale-to-zero. The ingress also can't passcheck_healthuntil every model has successfully started at least once, so one broken model blocks the whole service from becoming healthy.Root cause
OpenAiIngress.__init__schedules_setup_handle_and_config_maps, which awaitshandle.llm_config.remote()for each deployment. With 0 replicas the call queues at the router, bumpsongoing_requeststo 1, and the autoscaler scales 0 → 1. The fetchedLLMConfigis byte-identical to whatbuild_openai_appalready holds in memory.Fix
Pass static metadata into the ingress directly.
OpenAiIngress.__init__now takes:llm_deployments: Dict[str, DeploymentHandle]model_cards: Dict[str, ModelCard]lora_paths: Optional[Dict[str, str]] = Nonebuild_openai_app,build_dev_openai_app,build_pd_openai_app, andbuild_dp_openai_appbuild these dicts up front.get_lora_model_metadatatakesbase_path: strinstead ofLLMConfig. No remotellm_config()call at startup;check_healthis a no-op.Test plan
pytest python/ray/llm/tests/serve/cpu/deployments/routers/test_router.pypytest python/ray/llm/tests/serve/cpu/deployments/routers/test_dev_ingress.pypytest python/ray/llm/tests/serve/cpu/deployments/llm/multiplex/test_lora_deployment_base_client.pymin_replicas: 0, send no traffic, verify no replicas start and service reports healthy.