[Serve][3/n] Deployment-scoped actor lifecycle and deferred replica creation#61664
Conversation
Signed-off-by: abrar <abrar@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature: deployment-scoped actors with lifecycle management and deferred replica creation. The implementation is extensive, covering actor creation, recovery, and failure handling. The changes are well-structured and include comprehensive tests. I've identified two main areas for improvement. First, the failure handling for deployment actors could be more efficient; currently, a single actor failure causes all actors for that version to be recreated. Second, the automatic restart policy for these actors could be risky for stateful use cases, as it might lead to silent state loss on crashes. Addressing these points would enhance the robustness and performance of this new feature.
| ) | ||
| if merged_runtime_env: | ||
| actor_options["runtime_env"] = merged_runtime_env | ||
| actor_options["max_restarts"] = -1 |
There was a problem hiding this comment.
Setting max_restarts to -1 for deployment-scoped actors is risky, especially for stateful actors like caches or state stores as described in the pull request. If an actor crashes after it has become ready, Ray will restart it, but its internal state will be lost. The Serve controller currently does not seem to monitor the health of ready deployment actors, so this state loss can happen silently, leading to inconsistent application behavior.
Consider setting max_restarts to 0 and implementing a mechanism in the controller to detect actor failure and recreate it. Alternatively, this behavior and its implications for stateful actors should be clearly documented.
There was a problem hiding this comment.
will revisit this later, after add integration tests
There was a problem hiding this comment.
could we create an issue so that this doesn't slip through?
Signed-off-by: abrar <abrar@anyscale.com>
jeffreywang88
left a comment
There was a problem hiding this comment.
implementation makes a lot of sense to me! i think we're just missing some tests:
DeploymentActorContainerunit tests (add / get / pop / count / get_wrapper)ActorReplicaWrapperunit tests- any integration tests (running in a ray cluster) that we plan to add?
| ) | ||
| if merged_runtime_env: | ||
| actor_options["runtime_env"] = merged_runtime_env | ||
| actor_options["max_restarts"] = -1 |
There was a problem hiding this comment.
could we create an issue so that this doesn't slip through?
Signed-off-by: abrar <abrar@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
jeffreywang88
left a comment
There was a problem hiding this comment.
lgtm, leaving a nit
Signed-off-by: abrar <abrar@anyscale.com>
…reation (ray-project#61664) Introduces lifecycle management for deployment-scoped actors and defers replica creation until those actors are ready. ### Changes - **Deployment-scoped actor lifecycle**: Adds `DeploymentActorWrapper` and `DeploymentActorContainer` to manage deployment actors (start, readiness checks, stop) with `STARTING` / `RUNNING` states. - **Deferred replica creation**: When `deployment_actors` is configured, replicas are created only after all deployment actors are ready. This avoids starting replicas before shared actors (e.g., model caches, state stores) are available. - **Recovery**: On controller restart, existing deployment actors are recovered via `_recover_deployment_actors()` instead of recreating them. - **Status handling**: Deployment actor startup failures are surfaced via `DeploymentStatus.DEPLOY_FAILED` with `DEPLOYMENT_ACTOR_FAILED` as the trigger. --------- Signed-off-by: abrar <abrar@anyscale.com>
…reation (ray-project#61664) Introduces lifecycle management for deployment-scoped actors and defers replica creation until those actors are ready. ### Changes - **Deployment-scoped actor lifecycle**: Adds `DeploymentActorWrapper` and `DeploymentActorContainer` to manage deployment actors (start, readiness checks, stop) with `STARTING` / `RUNNING` states. - **Deferred replica creation**: When `deployment_actors` is configured, replicas are created only after all deployment actors are ready. This avoids starting replicas before shared actors (e.g., model caches, state stores) are available. - **Recovery**: On controller restart, existing deployment actors are recovered via `_recover_deployment_actors()` instead of recreating them. - **Status handling**: Deployment actor startup failures are surfaced via `DeploymentStatus.DEPLOY_FAILED` with `DEPLOYMENT_ACTOR_FAILED` as the trigger. --------- Signed-off-by: abrar <abrar@anyscale.com>

Introduces lifecycle management for deployment-scoped actors and defers replica creation until those actors are ready.
Changes
DeploymentActorWrapperandDeploymentActorContainerto manage deployment actors (start, readiness checks, stop) withSTARTING/RUNNINGstates.deployment_actorsis configured, replicas are created only after all deployment actors are ready. This avoids starting replicas before shared actors (e.g., model caches, state stores) are available._recover_deployment_actors()instead of recreating them.DeploymentStatus.DEPLOY_FAILEDwithDEPLOYMENT_ACTOR_FAILEDas the trigger.related #61464