Fix replica rank consistency check failure during node migration by harshit-anyscale · Pull Request #60365 · ray-project/ray

harshit-anyscale · 2026-01-21T14:17:47Z

Summary

Fixes the "Found active keys without ranks" error that occurs during node draining/migration when the rank consistency check runs with STARTING replicas that don't have ranks assigned yet.

Problem Description

When a node is drained, the Ray Serve controller migrates replicas to other nodes. During this migration, users encounter the following error:

ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug.
ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state.

This error occurs because of a race condition between replica creation and the rank consistency check.

Root Cause Analysis

The issue stems from the order of operations in the DeploymentStateManager.update() cycle:

STEP 1: check_and_update_replicas()    ← Contains rank consistency check
STEP 2: check_curr_status()            ← Sets status to HEALTHY
STEP 3: migrate_replicas_on_draining_nodes()  ← Moves replica to PENDING_MIGRATION
STEP 4: scale_deployment_replicas()    ← Creates new STARTING replica
STEP 5: check_curr_status()            ← **Should update status, but doesn't**

During node migration:

STEP 2 sets deployment status to HEALTHY (all replicas running, target met)
STEP 3 moves the replica on the draining node from RUNNING → PENDING_MIGRATION

STEP 4 creates a replacement replica in STARTING state because PENDING_MIGRATION is not counted in current_replicas:

current_replicas = self._replicas.count(
    states=[ReplicaState.STARTING, ReplicaState.UPDATING, ReplicaState.RUNNING]
    # PENDING_MIGRATION not included!
)

STEP 5 calls check_curr_status() but the status remains HEALTHY

In the next update cycle:

STEP 1 runs the rank consistency check because status == HEALTHY
The check includes the STARTING replica (via self._replicas.get() which returns all states)
But STARTING replicas don't have ranks - ranks are only assigned when transitioning to RUNNING
ERROR: "Found active keys without ranks"

Bug: `check_curr_status()` doesn't transition away from HEALTHY

When STARTING replicas exist, check_curr_status() does not update the status:

# Line 2971-2998 in deployment_state.py
if (
    self._replicas.count(
        states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...]
    )
    == 0  # Only enters block if NO STARTING replicas
):
    if (target == running):
        self._curr_status_info = ... HEALTHY

return False, any_replicas_recovering  # If STARTING exists, just returns without changing status!

The method only transitions TO HEALTHY, but never transitions AWAY from HEALTHY when conditions change. If STARTING replicas are added after the deployment was healthy, the status incorrectly remains HEALTHY.

Not sure why this was the decided behavior, but I presume that we did it because we don't want to show this migration replica state to the users as the deployment status.

Fix

This PR fixes the bug by adding an explicit check to skip the rank consistency check when STARTING replicas exist:

active_replicas = self._replicas.get()
if (
    active_replicas
    and self._curr_status_info.status == DeploymentStatus.HEALTHY
    # Skip consistency check if there are STARTING replicas. During node
    # migration, new replicas are created in STARTING state (without ranks)
    # after the status is set to HEALTHY. Running the consistency check
    # with STARTING replicas causes "active keys without ranks" error.
    and self._replicas.count(states=[ReplicaState.STARTING]) == 0
):

This is being done under the assumption that we want to call the rank consistency function with running replicas only(they shouldn't have any starting replicas in it).

Signed-off-by: harshit <harshit@anyscale.com>

gemini-code-assist

Code Review

This pull request provides a clear and effective fix for the race condition that occurs during node migration. The root cause analysis in the description is thorough and accurate. The proposed change, which skips the rank consistency check when there are STARTING replicas, is a safe and targeted solution to prevent the "active keys without ranks" error. The code change is correct, minimal, and includes a helpful comment explaining the reasoning. I approve of this change.

harshit-anyscale · 2026-01-23T08:21:31Z

Testing

Reproduction process: Deploy an application with one deployment, 3 replica, max replicas per node = 1.
Then manually go and delete a node, this would force replica to migrate from node 1 to new node.

Before the fix,

After the fix, no error about replica with unassigned rank, the error being shown is already registered in a separate issue.

abrarsheikh · 2026-01-23T17:10:41Z

+            # migration, new replicas are created in STARTING state (without ranks)
+            # after the status is set to HEALTHY. Running the consistency check
+            # with STARTING replicas causes "active keys without ranks" error.
+            and self._replicas.count(states=[ReplicaState.STARTING]) == 0


nice catch, let's add a test to catch this scenario.

done, added them in the rayturbo https://github.com/anyscale/rayturbo/pull/2949

…-project#60365) ## Summary Fixes the "Found active keys without ranks" error that occurs during node draining/migration when the rank consistency check runs with STARTING replicas that don't have ranks assigned yet. --- ## Problem Description When a node is drained, the Ray Serve controller migrates replicas to other nodes. During this migration, users encounter the following error: ``` ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug. ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state. ``` This error occurs because of a race condition between replica creation and the rank consistency check. --- ## Root Cause Analysis The issue stems from the order of operations in the `DeploymentStateManager.update()` cycle: ``` STEP 1: check_and_update_replicas() ← Contains rank consistency check STEP 2: check_curr_status() ← Sets status to HEALTHY STEP 3: migrate_replicas_on_draining_nodes() ← Moves replica to PENDING_MIGRATION STEP 4: scale_deployment_replicas() ← Creates new STARTING replica STEP 5: check_curr_status() ← **Should update status, but doesn't** ``` **During node migration:** 1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running, target met) 2. **STEP 3** moves the replica on the draining node from `RUNNING` → `PENDING_MIGRATION` 3. **STEP 4** creates a replacement replica in `STARTING` state because `PENDING_MIGRATION` is not counted in `current_replicas`: ```python current_replicas = self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ReplicaState.RUNNING] # PENDING_MIGRATION not included! ) ``` 4. **STEP 5** calls `check_curr_status()` but the status **remains HEALTHY** **In the next update cycle:** 1. **STEP 1** runs the rank consistency check because `status == HEALTHY` 2. The check includes the `STARTING` replica (via `self._replicas.get()` which returns all states) 3. But `STARTING` replicas don't have ranks - ranks are only assigned when transitioning to `RUNNING` 4. **ERROR: "Found active keys without ranks"** --- ### Bug: `check_curr_status()` doesn't transition away from HEALTHY When `STARTING` replicas exist, `check_curr_status()` does not update the status: ```python # Line 2971-2998 in deployment_state.py if ( self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...] ) == 0 # Only enters block if NO STARTING replicas ): if (target == running): self._curr_status_info = ... HEALTHY return False, any_replicas_recovering # If STARTING exists, just returns without changing status! ``` The method only transitions **TO** `HEALTHY`, but never transitions **AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas are added after the deployment was healthy, the status incorrectly remains `HEALTHY`. Not sure why this was the decided behavior, but I presume that we did it because we don't want to show this migration replica state to the users as the deployment status. --- ## Fix This PR fixes the bug by adding an explicit check to skip the rank consistency check when `STARTING` replicas exist: ```python active_replicas = self._replicas.get() if ( active_replicas and self._curr_status_info.status == DeploymentStatus.HEALTHY # Skip consistency check if there are STARTING replicas. During node # migration, new replicas are created in STARTING state (without ranks) # after the status is set to HEALTHY. Running the consistency check # with STARTING replicas causes "active keys without ranks" error. and self._replicas.count(states=[ReplicaState.STARTING]) == 0 ): ``` This is being done under the assumption that we want to call the rank consistency function with running replicas only(they shouldn't have any starting replicas in it). Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>

…-project#60365) ## Summary Fixes the "Found active keys without ranks" error that occurs during node draining/migration when the rank consistency check runs with STARTING replicas that don't have ranks assigned yet. --- ## Problem Description When a node is drained, the Ray Serve controller migrates replicas to other nodes. During this migration, users encounter the following error: ``` ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug. ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state. ``` This error occurs because of a race condition between replica creation and the rank consistency check. --- ## Root Cause Analysis The issue stems from the order of operations in the `DeploymentStateManager.update()` cycle: ``` STEP 1: check_and_update_replicas() ← Contains rank consistency check STEP 2: check_curr_status() ← Sets status to HEALTHY STEP 3: migrate_replicas_on_draining_nodes() ← Moves replica to PENDING_MIGRATION STEP 4: scale_deployment_replicas() ← Creates new STARTING replica STEP 5: check_curr_status() ← **Should update status, but doesn't** ``` **During node migration:** 1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running, target met) 2. **STEP 3** moves the replica on the draining node from `RUNNING` → `PENDING_MIGRATION` 3. **STEP 4** creates a replacement replica in `STARTING` state because `PENDING_MIGRATION` is not counted in `current_replicas`: ```python current_replicas = self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ReplicaState.RUNNING] # PENDING_MIGRATION not included! ) ``` 4. **STEP 5** calls `check_curr_status()` but the status **remains HEALTHY** **In the next update cycle:** 1. **STEP 1** runs the rank consistency check because `status == HEALTHY` 2. The check includes the `STARTING` replica (via `self._replicas.get()` which returns all states) 3. But `STARTING` replicas don't have ranks - ranks are only assigned when transitioning to `RUNNING` 4. **ERROR: "Found active keys without ranks"** --- ### Bug: `check_curr_status()` doesn't transition away from HEALTHY When `STARTING` replicas exist, `check_curr_status()` does not update the status: ```python # Line 2971-2998 in deployment_state.py if ( self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...] ) == 0 # Only enters block if NO STARTING replicas ): if (target == running): self._curr_status_info = ... HEALTHY return False, any_replicas_recovering # If STARTING exists, just returns without changing status! ``` The method only transitions **TO** `HEALTHY`, but never transitions **AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas are added after the deployment was healthy, the status incorrectly remains `HEALTHY`. Not sure why this was the decided behavior, but I presume that we did it because we don't want to show this migration replica state to the users as the deployment status. --- ## Fix This PR fixes the bug by adding an explicit check to skip the rank consistency check when `STARTING` replicas exist: ```python active_replicas = self._replicas.get() if ( active_replicas and self._curr_status_info.status == DeploymentStatus.HEALTHY # Skip consistency check if there are STARTING replicas. During node # migration, new replicas are created in STARTING state (without ranks) # after the status is set to HEALTHY. Running the consistency check # with STARTING replicas causes "active keys without ranks" error. and self._replicas.count(states=[ReplicaState.STARTING]) == 0 ): ``` This is being done under the assumption that we want to call the rank consistency function with running replicas only(they shouldn't have any starting replicas in it). Signed-off-by: harshit <harshit@anyscale.com>

) ## Summary Fixes the "Found active keys without ranks" error that occurs during node draining/migration when the rank consistency check runs with STARTING replicas that don't have ranks assigned yet. --- ## Problem Description When a node is drained, the Ray Serve controller migrates replicas to other nodes. During this migration, users encounter the following error: ``` ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug. ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state. ``` This error occurs because of a race condition between replica creation and the rank consistency check. --- ## Root Cause Analysis The issue stems from the order of operations in the `DeploymentStateManager.update()` cycle: ``` STEP 1: check_and_update_replicas() ← Contains rank consistency check STEP 2: check_curr_status() ← Sets status to HEALTHY STEP 3: migrate_replicas_on_draining_nodes() ← Moves replica to PENDING_MIGRATION STEP 4: scale_deployment_replicas() ← Creates new STARTING replica STEP 5: check_curr_status() ← **Should update status, but doesn't** ``` **During node migration:** 1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running, target met) 2. **STEP 3** moves the replica on the draining node from `RUNNING` → `PENDING_MIGRATION` 3. **STEP 4** creates a replacement replica in `STARTING` state because `PENDING_MIGRATION` is not counted in `current_replicas`: ```python current_replicas = self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ReplicaState.RUNNING] # PENDING_MIGRATION not included! ) ``` 4. **STEP 5** calls `check_curr_status()` but the status **remains HEALTHY** **In the next update cycle:** 1. **STEP 1** runs the rank consistency check because `status == HEALTHY` 2. The check includes the `STARTING` replica (via `self._replicas.get()` which returns all states) 3. But `STARTING` replicas don't have ranks - ranks are only assigned when transitioning to `RUNNING` 4. **ERROR: "Found active keys without ranks"** --- ### Bug: `check_curr_status()` doesn't transition away from HEALTHY When `STARTING` replicas exist, `check_curr_status()` does not update the status: ```python # Line 2971-2998 in deployment_state.py if ( self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...] ) == 0 # Only enters block if NO STARTING replicas ): if (target == running): self._curr_status_info = ... HEALTHY return False, any_replicas_recovering # If STARTING exists, just returns without changing status! ``` The method only transitions **TO** `HEALTHY`, but never transitions **AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas are added after the deployment was healthy, the status incorrectly remains `HEALTHY`. Not sure why this was the decided behavior, but I presume that we did it because we don't want to show this migration replica state to the users as the deployment status. --- ## Fix This PR fixes the bug by adding an explicit check to skip the rank consistency check when `STARTING` replicas exist: ```python active_replicas = self._replicas.get() if ( active_replicas and self._curr_status_info.status == DeploymentStatus.HEALTHY # Skip consistency check if there are STARTING replicas. During node # migration, new replicas are created in STARTING state (without ranks) # after the status is set to HEALTHY. Running the consistency check # with STARTING replicas causes "active keys without ranks" error. and self._replicas.count(states=[ReplicaState.STARTING]) == 0 ): ``` This is being done under the assumption that we want to call the rank consistency function with running replicas only(they shouldn't have any starting replicas in it). Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

) ## Summary Fixes the "Found active keys without ranks" error that occurs during node draining/migration when the rank consistency check runs with STARTING replicas that don't have ranks assigned yet. --- ## Problem Description When a node is drained, the Ray Serve controller migrates replicas to other nodes. During this migration, users encounter the following error: ``` ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug. ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state. ``` This error occurs because of a race condition between replica creation and the rank consistency check. --- ## Root Cause Analysis The issue stems from the order of operations in the `DeploymentStateManager.update()` cycle: ``` STEP 1: check_and_update_replicas() ← Contains rank consistency check STEP 2: check_curr_status() ← Sets status to HEALTHY STEP 3: migrate_replicas_on_draining_nodes() ← Moves replica to PENDING_MIGRATION STEP 4: scale_deployment_replicas() ← Creates new STARTING replica STEP 5: check_curr_status() ← **Should update status, but doesn't** ``` **During node migration:** 1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running, target met) 2. **STEP 3** moves the replica on the draining node from `RUNNING` → `PENDING_MIGRATION` 3. **STEP 4** creates a replacement replica in `STARTING` state because `PENDING_MIGRATION` is not counted in `current_replicas`: ```python current_replicas = self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ReplicaState.RUNNING] # PENDING_MIGRATION not included! ) ``` 4. **STEP 5** calls `check_curr_status()` but the status **remains HEALTHY** **In the next update cycle:** 1. **STEP 1** runs the rank consistency check because `status == HEALTHY` 2. The check includes the `STARTING` replica (via `self._replicas.get()` which returns all states) 3. But `STARTING` replicas don't have ranks - ranks are only assigned when transitioning to `RUNNING` 4. **ERROR: "Found active keys without ranks"** --- ### Bug: `check_curr_status()` doesn't transition away from HEALTHY When `STARTING` replicas exist, `check_curr_status()` does not update the status: ```python # Line 2971-2998 in deployment_state.py if ( self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...] ) == 0 # Only enters block if NO STARTING replicas ): if (target == running): self._curr_status_info = ... HEALTHY return False, any_replicas_recovering # If STARTING exists, just returns without changing status! ``` The method only transitions **TO** `HEALTHY`, but never transitions **AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas are added after the deployment was healthy, the status incorrectly remains `HEALTHY`. Not sure why this was the decided behavior, but I presume that we did it because we don't want to show this migration replica state to the users as the deployment status. --- ## Fix This PR fixes the bug by adding an explicit check to skip the rank consistency check when `STARTING` replicas exist: ```python active_replicas = self._replicas.get() if ( active_replicas and self._curr_status_info.status == DeploymentStatus.HEALTHY # Skip consistency check if there are STARTING replicas. During node # migration, new replicas are created in STARTING state (without ranks) # after the status is set to HEALTHY. Running the consistency check # with STARTING replicas causes "active keys without ranks" error. and self._replicas.count(states=[ReplicaState.STARTING]) == 0 ): ``` This is being done under the assumption that we want to call the rank consistency function with running replicas only(they shouldn't have any starting replicas in it). Signed-off-by: harshit <harshit@anyscale.com>

…-project#60365) ## Summary Fixes the "Found active keys without ranks" error that occurs during node draining/migration when the rank consistency check runs with STARTING replicas that don't have ranks assigned yet. --- ## Problem Description When a node is drained, the Ray Serve controller migrates replicas to other nodes. During this migration, users encounter the following error: ``` ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug. ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state. ``` This error occurs because of a race condition between replica creation and the rank consistency check. --- ## Root Cause Analysis The issue stems from the order of operations in the `DeploymentStateManager.update()` cycle: ``` STEP 1: check_and_update_replicas() ← Contains rank consistency check STEP 2: check_curr_status() ← Sets status to HEALTHY STEP 3: migrate_replicas_on_draining_nodes() ← Moves replica to PENDING_MIGRATION STEP 4: scale_deployment_replicas() ← Creates new STARTING replica STEP 5: check_curr_status() ← **Should update status, but doesn't** ``` **During node migration:** 1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running, target met) 2. **STEP 3** moves the replica on the draining node from `RUNNING` → `PENDING_MIGRATION` 3. **STEP 4** creates a replacement replica in `STARTING` state because `PENDING_MIGRATION` is not counted in `current_replicas`: ```python current_replicas = self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ReplicaState.RUNNING] # PENDING_MIGRATION not included! ) ``` 4. **STEP 5** calls `check_curr_status()` but the status **remains HEALTHY** **In the next update cycle:** 1. **STEP 1** runs the rank consistency check because `status == HEALTHY` 2. The check includes the `STARTING` replica (via `self._replicas.get()` which returns all states) 3. But `STARTING` replicas don't have ranks - ranks are only assigned when transitioning to `RUNNING` 4. **ERROR: "Found active keys without ranks"** --- ### Bug: `check_curr_status()` doesn't transition away from HEALTHY When `STARTING` replicas exist, `check_curr_status()` does not update the status: ```python # Line 2971-2998 in deployment_state.py if ( self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...] ) == 0 # Only enters block if NO STARTING replicas ): if (target == running): self._curr_status_info = ... HEALTHY return False, any_replicas_recovering # If STARTING exists, just returns without changing status! ``` The method only transitions **TO** `HEALTHY`, but never transitions **AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas are added after the deployment was healthy, the status incorrectly remains `HEALTHY`. Not sure why this was the decided behavior, but I presume that we did it because we don't want to show this migration replica state to the users as the deployment status. --- ## Fix This PR fixes the bug by adding an explicit check to skip the rank consistency check when `STARTING` replicas exist: ```python active_replicas = self._replicas.get() if ( active_replicas and self._curr_status_info.status == DeploymentStatus.HEALTHY # Skip consistency check if there are STARTING replicas. During node # migration, new replicas are created in STARTING state (without ranks) # after the status is set to HEALTHY. Running the consistency check # with STARTING replicas causes "active keys without ranks" error. and self._replicas.count(states=[ReplicaState.STARTING]) == 0 ): ``` This is being done under the assumption that we want to call the rank consistency function with running replicas only(they shouldn't have any starting replicas in it). Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>

…-project#60365) ## Summary Fixes the "Found active keys without ranks" error that occurs during node draining/migration when the rank consistency check runs with STARTING replicas that don't have ranks assigned yet. --- ## Problem Description When a node is drained, the Ray Serve controller migrates replicas to other nodes. During this migration, users encounter the following error: ``` ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug. ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state. ``` This error occurs because of a race condition between replica creation and the rank consistency check. --- ## Root Cause Analysis The issue stems from the order of operations in the `DeploymentStateManager.update()` cycle: ``` STEP 1: check_and_update_replicas() ← Contains rank consistency check STEP 2: check_curr_status() ← Sets status to HEALTHY STEP 3: migrate_replicas_on_draining_nodes() ← Moves replica to PENDING_MIGRATION STEP 4: scale_deployment_replicas() ← Creates new STARTING replica STEP 5: check_curr_status() ← **Should update status, but doesn't** ``` **During node migration:** 1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running, target met) 2. **STEP 3** moves the replica on the draining node from `RUNNING` → `PENDING_MIGRATION` 3. **STEP 4** creates a replacement replica in `STARTING` state because `PENDING_MIGRATION` is not counted in `current_replicas`: ```python current_replicas = self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ReplicaState.RUNNING] # PENDING_MIGRATION not included! ) ``` 4. **STEP 5** calls `check_curr_status()` but the status **remains HEALTHY** **In the next update cycle:** 1. **STEP 1** runs the rank consistency check because `status == HEALTHY` 2. The check includes the `STARTING` replica (via `self._replicas.get()` which returns all states) 3. But `STARTING` replicas don't have ranks - ranks are only assigned when transitioning to `RUNNING` 4. **ERROR: "Found active keys without ranks"** --- ### Bug: `check_curr_status()` doesn't transition away from HEALTHY When `STARTING` replicas exist, `check_curr_status()` does not update the status: ```python # Line 2971-2998 in deployment_state.py if ( self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...] ) == 0 # Only enters block if NO STARTING replicas ): if (target == running): self._curr_status_info = ... HEALTHY return False, any_replicas_recovering # If STARTING exists, just returns without changing status! ``` The method only transitions **TO** `HEALTHY`, but never transitions **AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas are added after the deployment was healthy, the status incorrectly remains `HEALTHY`. Not sure why this was the decided behavior, but I presume that we did it because we don't want to show this migration replica state to the users as the deployment status. --- ## Fix This PR fixes the bug by adding an explicit check to skip the rank consistency check when `STARTING` replicas exist: ```python active_replicas = self._replicas.get() if ( active_replicas and self._curr_status_info.status == DeploymentStatus.HEALTHY # Skip consistency check if there are STARTING replicas. During node # migration, new replicas are created in STARTING state (without ranks) # after the status is set to HEALTHY. Running the consistency check # with STARTING replicas causes "active keys without ranks" error. and self._replicas.count(states=[ReplicaState.STARTING]) == 0 ): ``` This is being done under the assumption that we want to call the rank consistency function with running replicas only(they shouldn't have any starting replicas in it). Signed-off-by: harshit <harshit@anyscale.com>

replica rank bugfix

32fab11

Signed-off-by: harshit <harshit@anyscale.com>

harshit-anyscale requested a review from a team as a code owner January 21, 2026 14:17

harshit-anyscale self-assigned this Jan 21, 2026

gemini-code-assist Bot reviewed Jan 21, 2026

View reviewed changes

ray-gardener Bot added the serve Ray Serve Related Issue label Jan 21, 2026

harshit-anyscale added the go add ONLY when ready to merge, run all tests label Jan 22, 2026

Merge branch 'master' into replica-rank-bugfix-v2

06e023a

abrarsheikh reviewed Jan 23, 2026

View reviewed changes

harshit-anyscale requested a review from abrarsheikh February 5, 2026 09:13

abrarsheikh approved these changes Feb 5, 2026

View reviewed changes

abrarsheikh merged commit 8c732fe into master Feb 5, 2026
6 checks passed

abrarsheikh deleted the replica-rank-bugfix-v2 branch February 5, 2026 19:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix replica rank consistency check failure during node migration#60365

Fix replica rank consistency check failure during node migration#60365
abrarsheikh merged 2 commits into
masterfrom
replica-rank-bugfix-v2

harshit-anyscale commented Jan 21, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

harshit-anyscale commented Jan 23, 2026

abrarsheikh Jan 23, 2026

harshit-anyscale Feb 5, 2026 •

edited

Loading

Uh oh!

Labels

2 participants

Uh oh!

Conversation

harshit-anyscale commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem Description

Root Cause Analysis

Bug: check_curr_status() doesn't transition away from HEALTHY

Fix

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

harshit-anyscale commented Jan 23, 2026

abrarsheikh Jan 23, 2026

Choose a reason for hiding this comment

harshit-anyscale Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Labels

2 participants

harshit-anyscale commented Jan 21, 2026 •

edited

Loading

Bug: `check_curr_status()` doesn't transition away from HEALTHY

harshit-anyscale Feb 5, 2026 •

edited

Loading