Skip to content

[Serve][3/n] Deployment-scoped actor lifecycle and deferred replica creation#61664

Merged
abrarsheikh merged 5 commits into
masterfrom
pr-2a-activation-v2
Mar 19, 2026
Merged

[Serve][3/n] Deployment-scoped actor lifecycle and deferred replica creation#61664
abrarsheikh merged 5 commits into
masterfrom
pr-2a-activation-v2

Conversation

@abrarsheikh

@abrarsheikh abrarsheikh commented Mar 11, 2026

Copy link
Copy Markdown
Contributor

Introduces lifecycle management for deployment-scoped actors and defers replica creation until those actors are ready.

Changes

  • Deployment-scoped actor lifecycle: Adds DeploymentActorWrapper and DeploymentActorContainer to manage deployment actors (start, readiness checks, stop) with STARTING / RUNNING states.
  • Deferred replica creation: When deployment_actors is configured, replicas are created only after all deployment actors are ready. This avoids starting replicas before shared actors (e.g., model caches, state stores) are available.
  • Recovery: On controller restart, existing deployment actors are recovered via _recover_deployment_actors() instead of recreating them.
  • Status handling: Deployment actor startup failures are surfaced via DeploymentStatus.DEPLOY_FAILED with DEPLOYMENT_ACTOR_FAILED as the trigger.

related #61464

@abrarsheikh abrarsheikh requested a review from a team as a code owner March 11, 2026 22:38
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Mar 11, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: deployment-scoped actors with lifecycle management and deferred replica creation. The implementation is extensive, covering actor creation, recovery, and failure handling. The changes are well-structured and include comprehensive tests. I've identified two main areas for improvement. First, the failure handling for deployment actors could be more efficient; currently, a single actor failure causes all actors for that version to be recreated. Second, the automatic restart policy for these actors could be risky for stateful use cases, as it might lead to silent state loss on crashes. Addressing these points would enhance the robustness and performance of this new feature.

)
if merged_runtime_env:
actor_options["runtime_env"] = merged_runtime_env
actor_options["max_restarts"] = -1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Setting max_restarts to -1 for deployment-scoped actors is risky, especially for stateful actors like caches or state stores as described in the pull request. If an actor crashes after it has become ready, Ray will restart it, but its internal state will be lost. The Serve controller currently does not seem to monitor the health of ready deployment actors, so this state loss can happen silently, leading to inconsistent application behavior.

Consider setting max_restarts to 0 and implementing a mechanism in the controller to detect actor failure and recreate it. Alternatively, this behavior and its implications for stateful actors should be clearly documented.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will revisit this later, after add integration tests

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we create an issue so that this doesn't slip through?

Comment thread python/ray/serve/_private/deployment_state.py Outdated
Comment thread python/ray/serve/_private/deployment_state.py
Comment thread python/ray/serve/_private/deployment_state.py Outdated
Signed-off-by: abrar <abrar@anyscale.com>
Comment thread python/ray/serve/_private/deployment_state.py Outdated
@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label Mar 12, 2026
Comment thread python/ray/serve/_private/deployment_state.py

@jeffreywang88 jeffreywang88 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementation makes a lot of sense to me! i think we're just missing some tests:

  • DeploymentActorContainer unit tests (add / get / pop / count / get_wrapper)
  • ActorReplicaWrapper unit tests
  • any integration tests (running in a ray cluster) that we plan to add?
Comment thread python/ray/serve/_private/deployment_state.py
Comment thread python/ray/serve/tests/unit/test_deployment_state.py Outdated
Comment thread python/ray/serve/_private/deployment_state.py Outdated
)
if merged_runtime_env:
actor_options["runtime_env"] = merged_runtime_env
actor_options["max_restarts"] = -1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we create an issue so that this doesn't slip through?

Comment thread python/ray/serve/_private/deployment_state.py Outdated
Comment thread python/ray/serve/_private/deployment_state.py Outdated
Comment thread python/ray/serve/tests/unit/test_deployment_state.py Outdated
Comment thread python/ray/serve/tests/unit/test_deployment_state.py Outdated
Comment thread python/ray/serve/tests/unit/test_deployment_state.py Outdated
Comment thread python/ray/serve/_private/deployment_state.py
Signed-off-by: abrar <abrar@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Comment thread python/ray/serve/_private/common.py

@jeffreywang88 jeffreywang88 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, leaving a nit

Comment thread python/ray/serve/_private/test_utils.py Outdated
Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh changed the title [Serve] Deployment-scoped actor lifecycle and deferred replica creation Mar 18, 2026
@abrarsheikh abrarsheikh merged commit 20eae5b into master Mar 19, 2026
6 checks passed
@abrarsheikh abrarsheikh deleted the pr-2a-activation-v2 branch March 19, 2026 06:44
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Mar 25, 2026
…reation (ray-project#61664)

Introduces lifecycle management for deployment-scoped actors and defers
replica creation until those actors are ready.

### Changes
- **Deployment-scoped actor lifecycle**: Adds `DeploymentActorWrapper`
and `DeploymentActorContainer` to manage deployment actors (start,
readiness checks, stop) with `STARTING` / `RUNNING` states.
- **Deferred replica creation**: When `deployment_actors` is configured,
replicas are created only after all deployment actors are ready. This
avoids starting replicas before shared actors (e.g., model caches, state
stores) are available.
- **Recovery**: On controller restart, existing deployment actors are
recovered via `_recover_deployment_actors()` instead of recreating them.
- **Status handling**: Deployment actor startup failures are surfaced
via `DeploymentStatus.DEPLOY_FAILED` with `DEPLOYMENT_ACTOR_FAILED` as
the trigger.

---------

Signed-off-by: abrar <abrar@anyscale.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…reation (ray-project#61664)

Introduces lifecycle management for deployment-scoped actors and defers
replica creation until those actors are ready.

### Changes
- **Deployment-scoped actor lifecycle**: Adds `DeploymentActorWrapper`
and `DeploymentActorContainer` to manage deployment actors (start,
readiness checks, stop) with `STARTING` / `RUNNING` states.
- **Deferred replica creation**: When `deployment_actors` is configured,
replicas are created only after all deployment actors are ready. This
avoids starting replicas before shared actors (e.g., model caches, state
stores) are available.
- **Recovery**: On controller restart, existing deployment actors are
recovered via `_recover_deployment_actors()` instead of recreating them.
- **Status handling**: Deployment actor startup failures are surfaced
via `DeploymentStatus.DEPLOY_FAILED` with `DEPLOYMENT_ACTOR_FAILED` as
the trigger.

---------

Signed-off-by: abrar <abrar@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

3 participants