Skip to content

[Serve] Drop and replace replicas that survive a controller crash without rank assignment#63139

Merged
abrarsheikh merged 2 commits into
masterfrom
63118-abrar-rank
May 6, 2026
Merged

[Serve] Drop and replace replicas that survive a controller crash without rank assignment#63139
abrarsheikh merged 2 commits into
masterfrom
63118-abrar-rank

Conversation

@abrarsheikh

Copy link
Copy Markdown
Contributor

related to #63118

After a controller restart, the controller occasionally floods its log with:

ERROR controller -- Error executing function _recover_rank_impl: 'NoneType' object has no attribute 'rank'

Trace: recover_current_state_from_replica_actor_names → ActorReplicaWrapper.recover → check_ready reads metadata back from the live actor and sets self._rank = None. The next reconcile cycle then calls RankManager.recover_rank(replica_id, node_id, None) and crashes when the impl dereferences rank.rank.

Root cause

Ranks are not checkpointed; they live only in controller memory and on the actor side via ray.serve.context._INTERNAL_REPLICA_CONTEXT. The actor's context starts as rank=None and is only set when initialize_and_get_metadata is called with a rank.

If the previous controller crashes between actor creation and the first initialize_and_get_metadata(rank=R, …) call, the actor is alive but uninitialized. On recovery, ActorReplicaWrapper.recover() calls initialize_and_get_metadata.remote() with no args — which silently completes the actor's first init with rank=None, returns metadata containing rank=None, and breaks rank tracking for that replica permanently. Once an actor is in this state, every future controller restart hits the same crash, and the rank-related deploy retry counter eventually pushes the deployment to DEPLOY_FAILED.

Fix

Detect uninitialized actors during recovery and replace them with fresh replicas, without bumping the deploy-failure counter.

…hout rank assignment

Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh requested a review from a team as a code owner May 5, 2026 18:55

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a non-blocking recovery mechanism for Ray Serve replicas to detect actors that failed to complete initialization due to a controller crash. It introduces an asynchronous was_initialized probe and logic to replace unrecoverable replicas without incrementing deployment failure counters. I have no feedback to provide.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit fcd9cc4. Configure here.

# `initialize_and_get_metadata` until the probe passes so we
# don't accidentally drive the bad actor through initialization
# only to kill it afterwards.
self._was_initialized_obj_ref = self._actor_handle.was_initialized.remote()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race: probe returns False during in-progress initialization

Medium Severity

The was_initialized() probe reads _user_callable_initialized without acquiring _user_callable_initialized_lock. Since async actors have max_concurrency=1000, this probe can execute concurrently with an in-progress initialize_and_get_metadata(rank=R) call dispatched by the previous controller. If the user's __init__ is slow (e.g., loading a large model), the probe returns False and the controller kills an actor that is legitimately initializing with a valid rank. Before this change, the recovery path called initialize_and_get_metadata.remote() which would block on the lock until the first call completed, then correctly return metadata with the assigned rank.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fcd9cc4. Configure here.

@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label May 5, 2026
Comment thread python/ray/serve/_private/deployment_state.py Outdated
@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label May 5, 2026
Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh merged commit 97ddb4c into master May 6, 2026
6 checks passed
@abrarsheikh abrarsheikh deleted the 63118-abrar-rank branch May 6, 2026 16:45
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…hout rank assignment (ray-project#63139)

related to ray-project#63118

After a controller restart, the controller occasionally floods its log
with:

```
ERROR controller -- Error executing function _recover_rank_impl: 'NoneType' object has no attribute 'rank'
```

Trace: `recover_current_state_from_replica_actor_names →
ActorReplicaWrapper.recover → check_ready` reads metadata back from the
live actor and sets `self._rank = None`. The next reconcile cycle then
calls `RankManager.recover_rank(replica_id, node_id, None)` and crashes
when the impl dereferences `rank.rank`.

### Root cause

Ranks are not checkpointed; they live only in controller memory and on
the actor side via `ray.serve.context._INTERNAL_REPLICA_CONTEXT`. The
actor's context starts as `rank=None` and is only set when
`initialize_and_get_metadata` is called *with* a rank.

If the previous controller crashes between actor creation and the first
`initialize_and_get_metadata(rank=R, …)` call, the actor is alive but
uninitialized. On recovery, `ActorReplicaWrapper.recover()` calls
`initialize_and_get_metadata.remote()` with no args — which silently
completes the actor's first init with `rank=None`, returns metadata
containing `rank=None`, and breaks rank tracking for that replica
permanently. Once an actor is in this state, every future controller
restart hits the same crash, and the rank-related deploy retry counter
eventually pushes the deployment to `DEPLOY_FAILED`.

### Fix

Detect uninitialized actors during recovery and replace them with fresh
replicas, without bumping the deploy-failure counter.

---------

Signed-off-by: abrar <abrar@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

3 participants