[Serve] Drop and replace replicas that survive a controller crash without rank assignment#63139
Conversation
…hout rank assignment Signed-off-by: abrar <abrar@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request implements a non-blocking recovery mechanism for Ray Serve replicas to detect actors that failed to complete initialization due to a controller crash. It introduces an asynchronous was_initialized probe and logic to replace unrecoverable replicas without incrementing deployment failure counters. I have no feedback to provide.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit fcd9cc4. Configure here.
| # `initialize_and_get_metadata` until the probe passes so we | ||
| # don't accidentally drive the bad actor through initialization | ||
| # only to kill it afterwards. | ||
| self._was_initialized_obj_ref = self._actor_handle.was_initialized.remote() |
There was a problem hiding this comment.
Race: probe returns False during in-progress initialization
Medium Severity
The was_initialized() probe reads _user_callable_initialized without acquiring _user_callable_initialized_lock. Since async actors have max_concurrency=1000, this probe can execute concurrently with an in-progress initialize_and_get_metadata(rank=R) call dispatched by the previous controller. If the user's __init__ is slow (e.g., loading a large model), the probe returns False and the controller kills an actor that is legitimately initializing with a valid rank. Before this change, the recovery path called initialize_and_get_metadata.remote() which would block on the lock until the first call completed, then correctly return metadata with the assigned rank.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit fcd9cc4. Configure here.
…hout rank assignment (ray-project#63139) related to ray-project#63118 After a controller restart, the controller occasionally floods its log with: ``` ERROR controller -- Error executing function _recover_rank_impl: 'NoneType' object has no attribute 'rank' ``` Trace: `recover_current_state_from_replica_actor_names → ActorReplicaWrapper.recover → check_ready` reads metadata back from the live actor and sets `self._rank = None`. The next reconcile cycle then calls `RankManager.recover_rank(replica_id, node_id, None)` and crashes when the impl dereferences `rank.rank`. ### Root cause Ranks are not checkpointed; they live only in controller memory and on the actor side via `ray.serve.context._INTERNAL_REPLICA_CONTEXT`. The actor's context starts as `rank=None` and is only set when `initialize_and_get_metadata` is called *with* a rank. If the previous controller crashes between actor creation and the first `initialize_and_get_metadata(rank=R, …)` call, the actor is alive but uninitialized. On recovery, `ActorReplicaWrapper.recover()` calls `initialize_and_get_metadata.remote()` with no args — which silently completes the actor's first init with `rank=None`, returns metadata containing `rank=None`, and breaks rank tracking for that replica permanently. Once an actor is in this state, every future controller restart hits the same crash, and the rank-related deploy retry counter eventually pushes the deployment to `DEPLOY_FAILED`. ### Fix Detect uninitialized actors during recovery and replace them with fresh replicas, without bumping the deploy-failure counter. --------- Signed-off-by: abrar <abrar@anyscale.com>


related to #63118
After a controller restart, the controller occasionally floods its log with:
Trace:
recover_current_state_from_replica_actor_names → ActorReplicaWrapper.recover → check_readyreads metadata back from the live actor and setsself._rank = None. The next reconcile cycle then callsRankManager.recover_rank(replica_id, node_id, None)and crashes when the impl dereferencesrank.rank.Root cause
Ranks are not checkpointed; they live only in controller memory and on the actor side via
ray.serve.context._INTERNAL_REPLICA_CONTEXT. The actor's context starts asrank=Noneand is only set wheninitialize_and_get_metadatais called with a rank.If the previous controller crashes between actor creation and the first
initialize_and_get_metadata(rank=R, …)call, the actor is alive but uninitialized. On recovery,ActorReplicaWrapper.recover()callsinitialize_and_get_metadata.remote()with no args — which silently completes the actor's first init withrank=None, returns metadata containingrank=None, and breaks rank tracking for that replica permanently. Once an actor is in this state, every future controller restart hits the same crash, and the rank-related deploy retry counter eventually pushes the deployment toDEPLOY_FAILED.Fix
Detect uninitialized actors during recovery and replace them with fresh replicas, without bumping the deploy-failure counter.