Fix replica rank consistency check failure during node migration#60365
Merged
Conversation
Signed-off-by: harshit <harshit@anyscale.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request provides a clear and effective fix for the race condition that occurs during node migration. The root cause analysis in the description is thorough and accurate. The proposed change, which skips the rank consistency check when there are STARTING replicas, is a safe and targeted solution to prevent the "active keys without ranks" error. The code change is correct, minimal, and includes a helpful comment explaining the reasoning. I approve of this change.
Contributor
Author
abrarsheikh
reviewed
Jan 23, 2026
| # migration, new replicas are created in STARTING state (without ranks) | ||
| # after the status is set to HEALTHY. Running the consistency check | ||
| # with STARTING replicas causes "active keys without ranks" error. | ||
| and self._replicas.count(states=[ReplicaState.STARTING]) == 0 |
Contributor
There was a problem hiding this comment.
nice catch, let's add a test to catch this scenario.
Contributor
Author
There was a problem hiding this comment.
done, added them in the rayturbo https://github.com/anyscale/rayturbo/pull/2949
abrarsheikh
approved these changes
Feb 5, 2026
tiennguyentony
pushed a commit
to tiennguyentony/ray
that referenced
this pull request
Feb 7, 2026
…-project#60365) ## Summary Fixes the "Found active keys without ranks" error that occurs during node draining/migration when the rank consistency check runs with STARTING replicas that don't have ranks assigned yet. --- ## Problem Description When a node is drained, the Ray Serve controller migrates replicas to other nodes. During this migration, users encounter the following error: ``` ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug. ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state. ``` This error occurs because of a race condition between replica creation and the rank consistency check. --- ## Root Cause Analysis The issue stems from the order of operations in the `DeploymentStateManager.update()` cycle: ``` STEP 1: check_and_update_replicas() ← Contains rank consistency check STEP 2: check_curr_status() ← Sets status to HEALTHY STEP 3: migrate_replicas_on_draining_nodes() ← Moves replica to PENDING_MIGRATION STEP 4: scale_deployment_replicas() ← Creates new STARTING replica STEP 5: check_curr_status() ← **Should update status, but doesn't** ``` **During node migration:** 1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running, target met) 2. **STEP 3** moves the replica on the draining node from `RUNNING` → `PENDING_MIGRATION` 3. **STEP 4** creates a replacement replica in `STARTING` state because `PENDING_MIGRATION` is not counted in `current_replicas`: ```python current_replicas = self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ReplicaState.RUNNING] # PENDING_MIGRATION not included! ) ``` 4. **STEP 5** calls `check_curr_status()` but the status **remains HEALTHY** **In the next update cycle:** 1. **STEP 1** runs the rank consistency check because `status == HEALTHY` 2. The check includes the `STARTING` replica (via `self._replicas.get()` which returns all states) 3. But `STARTING` replicas don't have ranks - ranks are only assigned when transitioning to `RUNNING` 4. **ERROR: "Found active keys without ranks"** --- ### Bug: `check_curr_status()` doesn't transition away from HEALTHY When `STARTING` replicas exist, `check_curr_status()` does not update the status: ```python # Line 2971-2998 in deployment_state.py if ( self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...] ) == 0 # Only enters block if NO STARTING replicas ): if (target == running): self._curr_status_info = ... HEALTHY return False, any_replicas_recovering # If STARTING exists, just returns without changing status! ``` The method only transitions **TO** `HEALTHY`, but never transitions **AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas are added after the deployment was healthy, the status incorrectly remains `HEALTHY`. Not sure why this was the decided behavior, but I presume that we did it because we don't want to show this migration replica state to the users as the deployment status. --- ## Fix This PR fixes the bug by adding an explicit check to skip the rank consistency check when `STARTING` replicas exist: ```python active_replicas = self._replicas.get() if ( active_replicas and self._curr_status_info.status == DeploymentStatus.HEALTHY # Skip consistency check if there are STARTING replicas. During node # migration, new replicas are created in STARTING state (without ranks) # after the status is set to HEALTHY. Running the consistency check # with STARTING replicas causes "active keys without ranks" error. and self._replicas.count(states=[ReplicaState.STARTING]) == 0 ): ``` This is being done under the assumption that we want to call the rank consistency function with running replicas only(they shouldn't have any starting replicas in it). Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
tiennguyentony
pushed a commit
to tiennguyentony/ray
that referenced
this pull request
Feb 7, 2026
…-project#60365) ## Summary Fixes the "Found active keys without ranks" error that occurs during node draining/migration when the rank consistency check runs with STARTING replicas that don't have ranks assigned yet. --- ## Problem Description When a node is drained, the Ray Serve controller migrates replicas to other nodes. During this migration, users encounter the following error: ``` ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug. ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state. ``` This error occurs because of a race condition between replica creation and the rank consistency check. --- ## Root Cause Analysis The issue stems from the order of operations in the `DeploymentStateManager.update()` cycle: ``` STEP 1: check_and_update_replicas() ← Contains rank consistency check STEP 2: check_curr_status() ← Sets status to HEALTHY STEP 3: migrate_replicas_on_draining_nodes() ← Moves replica to PENDING_MIGRATION STEP 4: scale_deployment_replicas() ← Creates new STARTING replica STEP 5: check_curr_status() ← **Should update status, but doesn't** ``` **During node migration:** 1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running, target met) 2. **STEP 3** moves the replica on the draining node from `RUNNING` → `PENDING_MIGRATION` 3. **STEP 4** creates a replacement replica in `STARTING` state because `PENDING_MIGRATION` is not counted in `current_replicas`: ```python current_replicas = self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ReplicaState.RUNNING] # PENDING_MIGRATION not included! ) ``` 4. **STEP 5** calls `check_curr_status()` but the status **remains HEALTHY** **In the next update cycle:** 1. **STEP 1** runs the rank consistency check because `status == HEALTHY` 2. The check includes the `STARTING` replica (via `self._replicas.get()` which returns all states) 3. But `STARTING` replicas don't have ranks - ranks are only assigned when transitioning to `RUNNING` 4. **ERROR: "Found active keys without ranks"** --- ### Bug: `check_curr_status()` doesn't transition away from HEALTHY When `STARTING` replicas exist, `check_curr_status()` does not update the status: ```python # Line 2971-2998 in deployment_state.py if ( self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...] ) == 0 # Only enters block if NO STARTING replicas ): if (target == running): self._curr_status_info = ... HEALTHY return False, any_replicas_recovering # If STARTING exists, just returns without changing status! ``` The method only transitions **TO** `HEALTHY`, but never transitions **AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas are added after the deployment was healthy, the status incorrectly remains `HEALTHY`. Not sure why this was the decided behavior, but I presume that we did it because we don't want to show this migration replica state to the users as the deployment status. --- ## Fix This PR fixes the bug by adding an explicit check to skip the rank consistency check when `STARTING` replicas exist: ```python active_replicas = self._replicas.get() if ( active_replicas and self._curr_status_info.status == DeploymentStatus.HEALTHY # Skip consistency check if there are STARTING replicas. During node # migration, new replicas are created in STARTING state (without ranks) # after the status is set to HEALTHY. Running the consistency check # with STARTING replicas causes "active keys without ranks" error. and self._replicas.count(states=[ReplicaState.STARTING]) == 0 ): ``` This is being done under the assumption that we want to call the rank consistency function with running replicas only(they shouldn't have any starting replicas in it). Signed-off-by: harshit <harshit@anyscale.com>
elliot-barn
pushed a commit
that referenced
this pull request
Feb 9, 2026
) ## Summary Fixes the "Found active keys without ranks" error that occurs during node draining/migration when the rank consistency check runs with STARTING replicas that don't have ranks assigned yet. --- ## Problem Description When a node is drained, the Ray Serve controller migrates replicas to other nodes. During this migration, users encounter the following error: ``` ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug. ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state. ``` This error occurs because of a race condition between replica creation and the rank consistency check. --- ## Root Cause Analysis The issue stems from the order of operations in the `DeploymentStateManager.update()` cycle: ``` STEP 1: check_and_update_replicas() ← Contains rank consistency check STEP 2: check_curr_status() ← Sets status to HEALTHY STEP 3: migrate_replicas_on_draining_nodes() ← Moves replica to PENDING_MIGRATION STEP 4: scale_deployment_replicas() ← Creates new STARTING replica STEP 5: check_curr_status() ← **Should update status, but doesn't** ``` **During node migration:** 1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running, target met) 2. **STEP 3** moves the replica on the draining node from `RUNNING` → `PENDING_MIGRATION` 3. **STEP 4** creates a replacement replica in `STARTING` state because `PENDING_MIGRATION` is not counted in `current_replicas`: ```python current_replicas = self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ReplicaState.RUNNING] # PENDING_MIGRATION not included! ) ``` 4. **STEP 5** calls `check_curr_status()` but the status **remains HEALTHY** **In the next update cycle:** 1. **STEP 1** runs the rank consistency check because `status == HEALTHY` 2. The check includes the `STARTING` replica (via `self._replicas.get()` which returns all states) 3. But `STARTING` replicas don't have ranks - ranks are only assigned when transitioning to `RUNNING` 4. **ERROR: "Found active keys without ranks"** --- ### Bug: `check_curr_status()` doesn't transition away from HEALTHY When `STARTING` replicas exist, `check_curr_status()` does not update the status: ```python # Line 2971-2998 in deployment_state.py if ( self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...] ) == 0 # Only enters block if NO STARTING replicas ): if (target == running): self._curr_status_info = ... HEALTHY return False, any_replicas_recovering # If STARTING exists, just returns without changing status! ``` The method only transitions **TO** `HEALTHY`, but never transitions **AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas are added after the deployment was healthy, the status incorrectly remains `HEALTHY`. Not sure why this was the decided behavior, but I presume that we did it because we don't want to show this migration replica state to the users as the deployment status. --- ## Fix This PR fixes the bug by adding an explicit check to skip the rank consistency check when `STARTING` replicas exist: ```python active_replicas = self._replicas.get() if ( active_replicas and self._curr_status_info.status == DeploymentStatus.HEALTHY # Skip consistency check if there are STARTING replicas. During node # migration, new replicas are created in STARTING state (without ranks) # after the status is set to HEALTHY. Running the consistency check # with STARTING replicas causes "active keys without ranks" error. and self._replicas.count(states=[ReplicaState.STARTING]) == 0 ): ``` This is being done under the assumption that we want to call the rank consistency function with running replicas only(they shouldn't have any starting replicas in it). Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn
pushed a commit
that referenced
this pull request
Feb 9, 2026
) ## Summary Fixes the "Found active keys without ranks" error that occurs during node draining/migration when the rank consistency check runs with STARTING replicas that don't have ranks assigned yet. --- ## Problem Description When a node is drained, the Ray Serve controller migrates replicas to other nodes. During this migration, users encounter the following error: ``` ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug. ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state. ``` This error occurs because of a race condition between replica creation and the rank consistency check. --- ## Root Cause Analysis The issue stems from the order of operations in the `DeploymentStateManager.update()` cycle: ``` STEP 1: check_and_update_replicas() ← Contains rank consistency check STEP 2: check_curr_status() ← Sets status to HEALTHY STEP 3: migrate_replicas_on_draining_nodes() ← Moves replica to PENDING_MIGRATION STEP 4: scale_deployment_replicas() ← Creates new STARTING replica STEP 5: check_curr_status() ← **Should update status, but doesn't** ``` **During node migration:** 1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running, target met) 2. **STEP 3** moves the replica on the draining node from `RUNNING` → `PENDING_MIGRATION` 3. **STEP 4** creates a replacement replica in `STARTING` state because `PENDING_MIGRATION` is not counted in `current_replicas`: ```python current_replicas = self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ReplicaState.RUNNING] # PENDING_MIGRATION not included! ) ``` 4. **STEP 5** calls `check_curr_status()` but the status **remains HEALTHY** **In the next update cycle:** 1. **STEP 1** runs the rank consistency check because `status == HEALTHY` 2. The check includes the `STARTING` replica (via `self._replicas.get()` which returns all states) 3. But `STARTING` replicas don't have ranks - ranks are only assigned when transitioning to `RUNNING` 4. **ERROR: "Found active keys without ranks"** --- ### Bug: `check_curr_status()` doesn't transition away from HEALTHY When `STARTING` replicas exist, `check_curr_status()` does not update the status: ```python # Line 2971-2998 in deployment_state.py if ( self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...] ) == 0 # Only enters block if NO STARTING replicas ): if (target == running): self._curr_status_info = ... HEALTHY return False, any_replicas_recovering # If STARTING exists, just returns without changing status! ``` The method only transitions **TO** `HEALTHY`, but never transitions **AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas are added after the deployment was healthy, the status incorrectly remains `HEALTHY`. Not sure why this was the decided behavior, but I presume that we did it because we don't want to show this migration replica state to the users as the deployment status. --- ## Fix This PR fixes the bug by adding an explicit check to skip the rank consistency check when `STARTING` replicas exist: ```python active_replicas = self._replicas.get() if ( active_replicas and self._curr_status_info.status == DeploymentStatus.HEALTHY # Skip consistency check if there are STARTING replicas. During node # migration, new replicas are created in STARTING state (without ranks) # after the status is set to HEALTHY. Running the consistency check # with STARTING replicas causes "active keys without ranks" error. and self._replicas.count(states=[ReplicaState.STARTING]) == 0 ): ``` This is being done under the assumption that we want to call the rank consistency function with running replicas only(they shouldn't have any starting replicas in it). Signed-off-by: harshit <harshit@anyscale.com>
ans9868
pushed a commit
to ans9868/ray
that referenced
this pull request
Feb 18, 2026
…-project#60365) ## Summary Fixes the "Found active keys without ranks" error that occurs during node draining/migration when the rank consistency check runs with STARTING replicas that don't have ranks assigned yet. --- ## Problem Description When a node is drained, the Ray Serve controller migrates replicas to other nodes. During this migration, users encounter the following error: ``` ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug. ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state. ``` This error occurs because of a race condition between replica creation and the rank consistency check. --- ## Root Cause Analysis The issue stems from the order of operations in the `DeploymentStateManager.update()` cycle: ``` STEP 1: check_and_update_replicas() ← Contains rank consistency check STEP 2: check_curr_status() ← Sets status to HEALTHY STEP 3: migrate_replicas_on_draining_nodes() ← Moves replica to PENDING_MIGRATION STEP 4: scale_deployment_replicas() ← Creates new STARTING replica STEP 5: check_curr_status() ← **Should update status, but doesn't** ``` **During node migration:** 1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running, target met) 2. **STEP 3** moves the replica on the draining node from `RUNNING` → `PENDING_MIGRATION` 3. **STEP 4** creates a replacement replica in `STARTING` state because `PENDING_MIGRATION` is not counted in `current_replicas`: ```python current_replicas = self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ReplicaState.RUNNING] # PENDING_MIGRATION not included! ) ``` 4. **STEP 5** calls `check_curr_status()` but the status **remains HEALTHY** **In the next update cycle:** 1. **STEP 1** runs the rank consistency check because `status == HEALTHY` 2. The check includes the `STARTING` replica (via `self._replicas.get()` which returns all states) 3. But `STARTING` replicas don't have ranks - ranks are only assigned when transitioning to `RUNNING` 4. **ERROR: "Found active keys without ranks"** --- ### Bug: `check_curr_status()` doesn't transition away from HEALTHY When `STARTING` replicas exist, `check_curr_status()` does not update the status: ```python # Line 2971-2998 in deployment_state.py if ( self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...] ) == 0 # Only enters block if NO STARTING replicas ): if (target == running): self._curr_status_info = ... HEALTHY return False, any_replicas_recovering # If STARTING exists, just returns without changing status! ``` The method only transitions **TO** `HEALTHY`, but never transitions **AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas are added after the deployment was healthy, the status incorrectly remains `HEALTHY`. Not sure why this was the decided behavior, but I presume that we did it because we don't want to show this migration replica state to the users as the deployment status. --- ## Fix This PR fixes the bug by adding an explicit check to skip the rank consistency check when `STARTING` replicas exist: ```python active_replicas = self._replicas.get() if ( active_replicas and self._curr_status_info.status == DeploymentStatus.HEALTHY # Skip consistency check if there are STARTING replicas. During node # migration, new replicas are created in STARTING state (without ranks) # after the status is set to HEALTHY. Running the consistency check # with STARTING replicas causes "active keys without ranks" error. and self._replicas.count(states=[ReplicaState.STARTING]) == 0 ): ``` This is being done under the assumption that we want to call the rank consistency function with running replicas only(they shouldn't have any starting replicas in it). Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
Aydin-ab
pushed a commit
to kunling-anyscale/ray
that referenced
this pull request
Feb 20, 2026
…-project#60365) ## Summary Fixes the "Found active keys without ranks" error that occurs during node draining/migration when the rank consistency check runs with STARTING replicas that don't have ranks assigned yet. --- ## Problem Description When a node is drained, the Ray Serve controller migrates replicas to other nodes. During this migration, users encounter the following error: ``` ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug. ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state. ``` This error occurs because of a race condition between replica creation and the rank consistency check. --- ## Root Cause Analysis The issue stems from the order of operations in the `DeploymentStateManager.update()` cycle: ``` STEP 1: check_and_update_replicas() ← Contains rank consistency check STEP 2: check_curr_status() ← Sets status to HEALTHY STEP 3: migrate_replicas_on_draining_nodes() ← Moves replica to PENDING_MIGRATION STEP 4: scale_deployment_replicas() ← Creates new STARTING replica STEP 5: check_curr_status() ← **Should update status, but doesn't** ``` **During node migration:** 1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running, target met) 2. **STEP 3** moves the replica on the draining node from `RUNNING` → `PENDING_MIGRATION` 3. **STEP 4** creates a replacement replica in `STARTING` state because `PENDING_MIGRATION` is not counted in `current_replicas`: ```python current_replicas = self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ReplicaState.RUNNING] # PENDING_MIGRATION not included! ) ``` 4. **STEP 5** calls `check_curr_status()` but the status **remains HEALTHY** **In the next update cycle:** 1. **STEP 1** runs the rank consistency check because `status == HEALTHY` 2. The check includes the `STARTING` replica (via `self._replicas.get()` which returns all states) 3. But `STARTING` replicas don't have ranks - ranks are only assigned when transitioning to `RUNNING` 4. **ERROR: "Found active keys without ranks"** --- ### Bug: `check_curr_status()` doesn't transition away from HEALTHY When `STARTING` replicas exist, `check_curr_status()` does not update the status: ```python # Line 2971-2998 in deployment_state.py if ( self._replicas.count( states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...] ) == 0 # Only enters block if NO STARTING replicas ): if (target == running): self._curr_status_info = ... HEALTHY return False, any_replicas_recovering # If STARTING exists, just returns without changing status! ``` The method only transitions **TO** `HEALTHY`, but never transitions **AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas are added after the deployment was healthy, the status incorrectly remains `HEALTHY`. Not sure why this was the decided behavior, but I presume that we did it because we don't want to show this migration replica state to the users as the deployment status. --- ## Fix This PR fixes the bug by adding an explicit check to skip the rank consistency check when `STARTING` replicas exist: ```python active_replicas = self._replicas.get() if ( active_replicas and self._curr_status_info.status == DeploymentStatus.HEALTHY # Skip consistency check if there are STARTING replicas. During node # migration, new replicas are created in STARTING state (without ranks) # after the status is set to HEALTHY. Running the consistency check # with STARTING replicas causes "active keys without ranks" error. and self._replicas.count(states=[ReplicaState.STARTING]) == 0 ): ``` This is being done under the assumption that we want to call the rank consistency function with running replicas only(they shouldn't have any starting replicas in it). Signed-off-by: harshit <harshit@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Summary
Fixes the "Found active keys without ranks" error that occurs during node draining/migration when the rank consistency check runs with STARTING replicas that don't have ranks assigned yet.
Problem Description
When a node is drained, the Ray Serve controller migrates replicas to other nodes. During this migration, users encounter the following error:
This error occurs because of a race condition between replica creation and the rank consistency check.
Root Cause Analysis
The issue stems from the order of operations in the
DeploymentStateManager.update()cycle:During node migration:
HEALTHY(all replicas running, target met)RUNNING→PENDING_MIGRATIONSTARTINGstate becausePENDING_MIGRATIONis not counted incurrent_replicas:check_curr_status()but the status remains HEALTHYIn the next update cycle:
status == HEALTHYSTARTINGreplica (viaself._replicas.get()which returns all states)STARTINGreplicas don't have ranks - ranks are only assigned when transitioning toRUNNINGBug:
check_curr_status()doesn't transition away from HEALTHYWhen
STARTINGreplicas exist,check_curr_status()does not update the status:The method only transitions TO
HEALTHY, but never transitions AWAY fromHEALTHYwhen conditions change. IfSTARTINGreplicas are added after the deployment was healthy, the status incorrectly remainsHEALTHY.Not sure why this was the decided behavior, but I presume that we did it because we don't want to show this migration replica state to the users as the deployment status.
Fix
This PR fixes the bug by adding an explicit check to skip the rank consistency check when
STARTINGreplicas exist:This is being done under the assumption that we want to call the rank consistency function with running replicas only(they shouldn't have any starting replicas in it).