[Serve] Drop and replace replicas that survive a controller crash without rank assignment by abrarsheikh · Pull Request #63139 · ray-project/ray

abrarsheikh · 2026-05-05T18:55:18Z

related to #63118

After a controller restart, the controller occasionally floods its log with:

ERROR controller -- Error executing function _recover_rank_impl: 'NoneType' object has no attribute 'rank'

Trace: recover_current_state_from_replica_actor_names → ActorReplicaWrapper.recover → check_ready reads metadata back from the live actor and sets self._rank = None. The next reconcile cycle then calls RankManager.recover_rank(replica_id, node_id, None) and crashes when the impl dereferences rank.rank.

Root cause

Ranks are not checkpointed; they live only in controller memory and on the actor side via ray.serve.context._INTERNAL_REPLICA_CONTEXT. The actor's context starts as rank=None and is only set when initialize_and_get_metadata is called with a rank.

If the previous controller crashes between actor creation and the first initialize_and_get_metadata(rank=R, …) call, the actor is alive but uninitialized. On recovery, ActorReplicaWrapper.recover() calls initialize_and_get_metadata.remote() with no args — which silently completes the actor's first init with rank=None, returns metadata containing rank=None, and breaks rank tracking for that replica permanently. Once an actor is in this state, every future controller restart hits the same crash, and the rank-related deploy retry counter eventually pushes the deployment to DEPLOY_FAILED.

Fix

Detect uninitialized actors during recovery and replace them with fresh replicas, without bumping the deploy-failure counter.

…hout rank assignment Signed-off-by: abrar <abrar@anyscale.com>

gemini-code-assist

Code Review

This pull request implements a non-blocking recovery mechanism for Ray Serve replicas to detect actors that failed to complete initialization due to a controller crash. It introduces an asynchronous was_initialized probe and logic to replace unrecoverable replicas without incrementing deployment failure counters. I have no feedback to provide.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit fcd9cc4. Configure here.}

cursor · 2026-05-05T19:01:57Z

+            # `initialize_and_get_metadata` until the probe passes so we
+            # don't accidentally drive the bad actor through initialization
+            # only to kill it afterwards.
+            self._was_initialized_obj_ref = self._actor_handle.was_initialized.remote()


Race: probe returns False during in-progress initialization

Medium Severity

The was_initialized() probe reads _user_callable_initialized without acquiring _user_callable_initialized_lock. Since async actors have max_concurrency=1000, this probe can execute concurrently with an in-progress initialize_and_get_metadata(rank=R) call dispatched by the previous controller. If the user's __init__ is slow (e.g., loading a large model), the probe returns False and the controller kills an actor that is legitimately initializing with a valid rank. Before this change, the recovery path called initialize_and_get_metadata.remote() which would block on the lock until the first call completed, then correctly return metadata with the assigned rank.

Additional Locations (1)

python/ray/serve/_private/replica.py#L2779-L2792

^{Reviewed by Cursor Bugbot for commit fcd9cc4. Configure here.}

Signed-off-by: abrar <abrar@anyscale.com>

…hout rank assignment (ray-project#63139) related to ray-project#63118 After a controller restart, the controller occasionally floods its log with: ``` ERROR controller -- Error executing function _recover_rank_impl: 'NoneType' object has no attribute 'rank' ``` Trace: `recover_current_state_from_replica_actor_names → ActorReplicaWrapper.recover → check_ready` reads metadata back from the live actor and sets `self._rank = None`. The next reconcile cycle then calls `RankManager.recover_rank(replica_id, node_id, None)` and crashes when the impl dereferences `rank.rank`. ### Root cause Ranks are not checkpointed; they live only in controller memory and on the actor side via `ray.serve.context._INTERNAL_REPLICA_CONTEXT`. The actor's context starts as `rank=None` and is only set when `initialize_and_get_metadata` is called *with* a rank. If the previous controller crashes between actor creation and the first `initialize_and_get_metadata(rank=R, …)` call, the actor is alive but uninitialized. On recovery, `ActorReplicaWrapper.recover()` calls `initialize_and_get_metadata.remote()` with no args — which silently completes the actor's first init with `rank=None`, returns metadata containing `rank=None`, and breaks rank tracking for that replica permanently. Once an actor is in this state, every future controller restart hits the same crash, and the rank-related deploy retry counter eventually pushes the deployment to `DEPLOY_FAILED`. ### Fix Detect uninitialized actors during recovery and replace them with fresh replicas, without bumping the deploy-failure counter. --------- Signed-off-by: abrar <abrar@anyscale.com>

[Serve] Drop and replace replicas that survive a controller crash wit…

fcd9cc4

…hout rank assignment Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh requested a review from a team as a code owner May 5, 2026 18:55

abrarsheikh requested review from akyang-anyscale and jeffreywang88 May 5, 2026 18:55

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

cursor Bot reviewed May 5, 2026

View reviewed changes

abrarsheikh added the go add ONLY when ready to merge, run all tests label May 5, 2026

akyang-anyscale approved these changes May 5, 2026

View reviewed changes

Comment thread python/ray/serve/_private/deployment_state.py Outdated

ray-gardener Bot added the serve Ray Serve Related Issue label May 5, 2026

fix test

f21037a

Signed-off-by: abrar <abrar@anyscale.com>

jeffreywang88 approved these changes May 6, 2026

View reviewed changes

abrarsheikh merged commit 97ddb4c into master May 6, 2026
6 checks passed

abrarsheikh deleted the 63118-abrar-rank branch May 6, 2026 16:45

abrarsheikh mentioned this pull request May 6, 2026

[CORE] Exception in Serve Controller and Probable GCS memory leak leading to controller OOM-killed #63118

Open

This was referenced May 7, 2026

[Serve] Restart gang when recovery drops uninitialized member #63203

Closed

[Serve] Restart gang when recovery drops uninitialized member #63204

Closed

jeffreywang88 mentioned this pull request May 7, 2026

[serve] Recover gang context for orphaned replicas to restart the whole gang #63208

Open

This was referenced Jun 16, 2026

[Serve] State / Ranker issues when ServeController gets killed #64103

Closed

[Serve] OOMKiller with TimeBasedWorkerKillingPolicy in RayServe kills ServeController, Ranker issues follow #63862

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Serve] Drop and replace replicas that survive a controller crash without rank assignment#63139

[Serve] Drop and replace replicas that survive a controller crash without rank assignment#63139
abrarsheikh merged 2 commits into
masterfrom
63118-abrar-rank

abrarsheikh commented May 5, 2026

gemini-code-assist Bot left a comment

cursor Bot left a comment

cursor Bot May 5, 2026

Uh oh!

Uh oh!

Labels

3 participants

Uh oh!

Conversation

abrarsheikh commented May 5, 2026

Root cause

Fix

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

cursor Bot left a comment

Choose a reason for hiding this comment

cursor Bot May 5, 2026

Choose a reason for hiding this comment

Race: probe returns False during in-progress initialization

Uh oh!

Uh oh!

Labels

3 participants