Skip to content

Fix replica rank consistency check failure during node migration#60365

Merged
abrarsheikh merged 2 commits into
masterfrom
replica-rank-bugfix-v2
Feb 5, 2026
Merged

Fix replica rank consistency check failure during node migration#60365
abrarsheikh merged 2 commits into
masterfrom
replica-rank-bugfix-v2

Conversation

@harshit-anyscale

@harshit-anyscale harshit-anyscale commented Jan 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes the "Found active keys without ranks" error that occurs during node draining/migration when the rank consistency check runs with STARTING replicas that don't have ranks assigned yet.


Problem Description

When a node is drained, the Ray Serve controller migrates replicas to other nodes. During this migration, users encounter the following error:

ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug.
ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state.

This error occurs because of a race condition between replica creation and the rank consistency check.


Root Cause Analysis

The issue stems from the order of operations in the DeploymentStateManager.update() cycle:

STEP 1: check_and_update_replicas()    ← Contains rank consistency check
STEP 2: check_curr_status()            ← Sets status to HEALTHY
STEP 3: migrate_replicas_on_draining_nodes()  ← Moves replica to PENDING_MIGRATION
STEP 4: scale_deployment_replicas()    ← Creates new STARTING replica
STEP 5: check_curr_status()            ← **Should update status, but doesn't**

During node migration:

  1. STEP 2 sets deployment status to HEALTHY (all replicas running, target met)
  2. STEP 3 moves the replica on the draining node from RUNNINGPENDING_MIGRATION
  3. STEP 4 creates a replacement replica in STARTING state because PENDING_MIGRATION is not counted in current_replicas:
    current_replicas = self._replicas.count(
        states=[ReplicaState.STARTING, ReplicaState.UPDATING, ReplicaState.RUNNING]
        # PENDING_MIGRATION not included!
    )
  4. STEP 5 calls check_curr_status() but the status remains HEALTHY

In the next update cycle:

  1. STEP 1 runs the rank consistency check because status == HEALTHY
  2. The check includes the STARTING replica (via self._replicas.get() which returns all states)
  3. But STARTING replicas don't have ranks - ranks are only assigned when transitioning to RUNNING
  4. ERROR: "Found active keys without ranks"

Bug: check_curr_status() doesn't transition away from HEALTHY

When STARTING replicas exist, check_curr_status() does not update the status:

# Line 2971-2998 in deployment_state.py
if (
    self._replicas.count(
        states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...]
    )
    == 0  # Only enters block if NO STARTING replicas
):
    if (target == running):
        self._curr_status_info = ... HEALTHY

return False, any_replicas_recovering  # If STARTING exists, just returns without changing status!

The method only transitions TO HEALTHY, but never transitions AWAY from HEALTHY when conditions change. If STARTING replicas are added after the deployment was healthy, the status incorrectly remains HEALTHY.

Not sure why this was the decided behavior, but I presume that we did it because we don't want to show this migration replica state to the users as the deployment status.


Fix

This PR fixes the bug by adding an explicit check to skip the rank consistency check when STARTING replicas exist:

active_replicas = self._replicas.get()
if (
    active_replicas
    and self._curr_status_info.status == DeploymentStatus.HEALTHY
    # Skip consistency check if there are STARTING replicas. During node
    # migration, new replicas are created in STARTING state (without ranks)
    # after the status is set to HEALTHY. Running the consistency check
    # with STARTING replicas causes "active keys without ranks" error.
    and self._replicas.count(states=[ReplicaState.STARTING]) == 0
):

This is being done under the assumption that we want to call the rank consistency function with running replicas only(they shouldn't have any starting replicas in it).

Signed-off-by: harshit <harshit@anyscale.com>
@harshit-anyscale harshit-anyscale requested a review from a team as a code owner January 21, 2026 14:17
@harshit-anyscale harshit-anyscale self-assigned this Jan 21, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a clear and effective fix for the race condition that occurs during node migration. The root cause analysis in the description is thorough and accurate. The proposed change, which skips the rank consistency check when there are STARTING replicas, is a safe and targeted solution to prevent the "active keys without ranks" error. The code change is correct, minimal, and includes a helpful comment explaining the reasoning. I approve of this change.

@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label Jan 21, 2026
@harshit-anyscale harshit-anyscale added the go add ONLY when ready to merge, run all tests label Jan 22, 2026
@harshit-anyscale

Copy link
Copy Markdown
Contributor Author

Testing

Reproduction process: Deploy an application with one deployment, 3 replica, max replicas per node = 1.
Then manually go and delete a node, this would force replica to migrate from node 1 to new node.

Before the fix,
Screenshot 2026-01-23 at 12 01 51

After the fix, no error about replica with unassigned rank, the error being shown is already registered in a separate issue.
Screenshot 2026-01-23 at 13 49 23

# migration, new replicas are created in STARTING state (without ranks)
# after the status is set to HEALTHY. Running the consistency check
# with STARTING replicas causes "active keys without ranks" error.
and self._replicas.count(states=[ReplicaState.STARTING]) == 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, let's add a test to catch this scenario.

@harshit-anyscale harshit-anyscale Feb 5, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, added them in the rayturbo https://github.com/anyscale/rayturbo/pull/2949

@abrarsheikh abrarsheikh merged commit 8c732fe into master Feb 5, 2026
6 checks passed
@abrarsheikh abrarsheikh deleted the replica-rank-bugfix-v2 branch February 5, 2026 19:06
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…-project#60365)


## Summary

Fixes the "Found active keys without ranks" error that occurs during
node draining/migration when the rank consistency check runs with
STARTING replicas that don't have ranks assigned yet.

---

## Problem Description

When a node is drained, the Ray Serve controller migrates replicas to
other nodes. During this migration, users encounter the following error:

```
ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug.
ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state.
```

This error occurs because of a race condition between replica creation
and the rank consistency check.

---

## Root Cause Analysis

The issue stems from the order of operations in the
`DeploymentStateManager.update()` cycle:

```
STEP 1: check_and_update_replicas()    ← Contains rank consistency check
STEP 2: check_curr_status()            ← Sets status to HEALTHY
STEP 3: migrate_replicas_on_draining_nodes()  ← Moves replica to PENDING_MIGRATION
STEP 4: scale_deployment_replicas()    ← Creates new STARTING replica
STEP 5: check_curr_status()            ← **Should update status, but doesn't**
```

**During node migration:**

1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running,
target met)
2. **STEP 3** moves the replica on the draining node from `RUNNING` →
`PENDING_MIGRATION`
3. **STEP 4** creates a replacement replica in `STARTING` state because
`PENDING_MIGRATION` is not counted in `current_replicas`:
   ```python
   current_replicas = self._replicas.count(
states=[ReplicaState.STARTING, ReplicaState.UPDATING,
ReplicaState.RUNNING]
       # PENDING_MIGRATION not included!
   )
   ```
4. **STEP 5** calls `check_curr_status()` but the status **remains
HEALTHY**

**In the next update cycle:**

1. **STEP 1** runs the rank consistency check because `status ==
HEALTHY`
2. The check includes the `STARTING` replica (via `self._replicas.get()`
which returns all states)
3. But `STARTING` replicas don't have ranks - ranks are only assigned
when transitioning to `RUNNING`
4. **ERROR: "Found active keys without ranks"**

---

### Bug: `check_curr_status()` doesn't transition away from HEALTHY

When `STARTING` replicas exist, `check_curr_status()` does not update
the status:

```python
# Line 2971-2998 in deployment_state.py
if (
    self._replicas.count(
        states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...]
    )
    == 0  # Only enters block if NO STARTING replicas
):
    if (target == running):
        self._curr_status_info = ... HEALTHY

return False, any_replicas_recovering  # If STARTING exists, just returns without changing status!
```

The method only transitions **TO** `HEALTHY`, but never transitions
**AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas
are added after the deployment was healthy, the status incorrectly
remains `HEALTHY`.

Not sure why this was the decided behavior, but I presume that we did it
because we don't want to show this migration replica state to the users
as the deployment status.

---

## Fix

This PR fixes the bug by adding an explicit check to skip the rank
consistency check when `STARTING` replicas exist:

```python
active_replicas = self._replicas.get()
if (
    active_replicas
    and self._curr_status_info.status == DeploymentStatus.HEALTHY
    # Skip consistency check if there are STARTING replicas. During node
    # migration, new replicas are created in STARTING state (without ranks)
    # after the status is set to HEALTHY. Running the consistency check
    # with STARTING replicas causes "active keys without ranks" error.
    and self._replicas.count(states=[ReplicaState.STARTING]) == 0
):
```

This is being done under the assumption that we want to call the rank
consistency function with running replicas only(they shouldn't have any
starting replicas in it).

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…-project#60365)


## Summary

Fixes the "Found active keys without ranks" error that occurs during
node draining/migration when the rank consistency check runs with
STARTING replicas that don't have ranks assigned yet.

---

## Problem Description

When a node is drained, the Ray Serve controller migrates replicas to
other nodes. During this migration, users encounter the following error:

```
ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug.
ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state.
```

This error occurs because of a race condition between replica creation
and the rank consistency check.

---

## Root Cause Analysis

The issue stems from the order of operations in the
`DeploymentStateManager.update()` cycle:

```
STEP 1: check_and_update_replicas()    ← Contains rank consistency check
STEP 2: check_curr_status()            ← Sets status to HEALTHY
STEP 3: migrate_replicas_on_draining_nodes()  ← Moves replica to PENDING_MIGRATION
STEP 4: scale_deployment_replicas()    ← Creates new STARTING replica
STEP 5: check_curr_status()            ← **Should update status, but doesn't**
```

**During node migration:**

1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running,
target met)
2. **STEP 3** moves the replica on the draining node from `RUNNING` →
`PENDING_MIGRATION`
3. **STEP 4** creates a replacement replica in `STARTING` state because
`PENDING_MIGRATION` is not counted in `current_replicas`:
   ```python
   current_replicas = self._replicas.count(
states=[ReplicaState.STARTING, ReplicaState.UPDATING,
ReplicaState.RUNNING]
       # PENDING_MIGRATION not included!
   )
   ```
4. **STEP 5** calls `check_curr_status()` but the status **remains
HEALTHY**

**In the next update cycle:**

1. **STEP 1** runs the rank consistency check because `status ==
HEALTHY`
2. The check includes the `STARTING` replica (via `self._replicas.get()`
which returns all states)
3. But `STARTING` replicas don't have ranks - ranks are only assigned
when transitioning to `RUNNING`
4. **ERROR: "Found active keys without ranks"**

---

### Bug: `check_curr_status()` doesn't transition away from HEALTHY

When `STARTING` replicas exist, `check_curr_status()` does not update
the status:

```python
# Line 2971-2998 in deployment_state.py
if (
    self._replicas.count(
        states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...]
    )
    == 0  # Only enters block if NO STARTING replicas
):
    if (target == running):
        self._curr_status_info = ... HEALTHY

return False, any_replicas_recovering  # If STARTING exists, just returns without changing status!
```

The method only transitions **TO** `HEALTHY`, but never transitions
**AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas
are added after the deployment was healthy, the status incorrectly
remains `HEALTHY`.

Not sure why this was the decided behavior, but I presume that we did it
because we don't want to show this migration replica state to the users
as the deployment status.

---

## Fix

This PR fixes the bug by adding an explicit check to skip the rank
consistency check when `STARTING` replicas exist:

```python
active_replicas = self._replicas.get()
if (
    active_replicas
    and self._curr_status_info.status == DeploymentStatus.HEALTHY
    # Skip consistency check if there are STARTING replicas. During node
    # migration, new replicas are created in STARTING state (without ranks)
    # after the status is set to HEALTHY. Running the consistency check
    # with STARTING replicas causes "active keys without ranks" error.
    and self._replicas.count(states=[ReplicaState.STARTING]) == 0
):
```

This is being done under the assumption that we want to call the rank
consistency function with running replicas only(they shouldn't have any
starting replicas in it).

Signed-off-by: harshit <harshit@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
)

## Summary

Fixes the "Found active keys without ranks" error that occurs during
node draining/migration when the rank consistency check runs with
STARTING replicas that don't have ranks assigned yet.

---

## Problem Description

When a node is drained, the Ray Serve controller migrates replicas to
other nodes. During this migration, users encounter the following error:

```
ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug.
ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state.
```

This error occurs because of a race condition between replica creation
and the rank consistency check.

---

## Root Cause Analysis

The issue stems from the order of operations in the
`DeploymentStateManager.update()` cycle:

```
STEP 1: check_and_update_replicas()    ← Contains rank consistency check
STEP 2: check_curr_status()            ← Sets status to HEALTHY
STEP 3: migrate_replicas_on_draining_nodes()  ← Moves replica to PENDING_MIGRATION
STEP 4: scale_deployment_replicas()    ← Creates new STARTING replica
STEP 5: check_curr_status()            ← **Should update status, but doesn't**
```

**During node migration:**

1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running,
target met)
2. **STEP 3** moves the replica on the draining node from `RUNNING` →
`PENDING_MIGRATION`
3. **STEP 4** creates a replacement replica in `STARTING` state because
`PENDING_MIGRATION` is not counted in `current_replicas`:
   ```python
   current_replicas = self._replicas.count(
states=[ReplicaState.STARTING, ReplicaState.UPDATING,
ReplicaState.RUNNING]
       # PENDING_MIGRATION not included!
   )
   ```
4. **STEP 5** calls `check_curr_status()` but the status **remains
HEALTHY**

**In the next update cycle:**

1. **STEP 1** runs the rank consistency check because `status ==
HEALTHY`
2. The check includes the `STARTING` replica (via `self._replicas.get()`
which returns all states)
3. But `STARTING` replicas don't have ranks - ranks are only assigned
when transitioning to `RUNNING`
4. **ERROR: "Found active keys without ranks"**

---

### Bug: `check_curr_status()` doesn't transition away from HEALTHY

When `STARTING` replicas exist, `check_curr_status()` does not update
the status:

```python
# Line 2971-2998 in deployment_state.py
if (
    self._replicas.count(
        states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...]
    )
    == 0  # Only enters block if NO STARTING replicas
):
    if (target == running):
        self._curr_status_info = ... HEALTHY

return False, any_replicas_recovering  # If STARTING exists, just returns without changing status!
```

The method only transitions **TO** `HEALTHY`, but never transitions
**AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas
are added after the deployment was healthy, the status incorrectly
remains `HEALTHY`.

Not sure why this was the decided behavior, but I presume that we did it
because we don't want to show this migration replica state to the users
as the deployment status.

---

## Fix

This PR fixes the bug by adding an explicit check to skip the rank
consistency check when `STARTING` replicas exist:

```python
active_replicas = self._replicas.get()
if (
    active_replicas
    and self._curr_status_info.status == DeploymentStatus.HEALTHY
    # Skip consistency check if there are STARTING replicas. During node
    # migration, new replicas are created in STARTING state (without ranks)
    # after the status is set to HEALTHY. Running the consistency check
    # with STARTING replicas causes "active keys without ranks" error.
    and self._replicas.count(states=[ReplicaState.STARTING]) == 0
):
```

This is being done under the assumption that we want to call the rank
consistency function with running replicas only(they shouldn't have any
starting replicas in it).

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
)

## Summary

Fixes the "Found active keys without ranks" error that occurs during
node draining/migration when the rank consistency check runs with
STARTING replicas that don't have ranks assigned yet.

---

## Problem Description

When a node is drained, the Ray Serve controller migrates replicas to
other nodes. During this migration, users encounter the following error:

```
ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug.
ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state.
```

This error occurs because of a race condition between replica creation
and the rank consistency check.

---

## Root Cause Analysis

The issue stems from the order of operations in the
`DeploymentStateManager.update()` cycle:

```
STEP 1: check_and_update_replicas()    ← Contains rank consistency check
STEP 2: check_curr_status()            ← Sets status to HEALTHY
STEP 3: migrate_replicas_on_draining_nodes()  ← Moves replica to PENDING_MIGRATION
STEP 4: scale_deployment_replicas()    ← Creates new STARTING replica
STEP 5: check_curr_status()            ← **Should update status, but doesn't**
```

**During node migration:**

1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running,
target met)
2. **STEP 3** moves the replica on the draining node from `RUNNING` →
`PENDING_MIGRATION`
3. **STEP 4** creates a replacement replica in `STARTING` state because
`PENDING_MIGRATION` is not counted in `current_replicas`:
   ```python
   current_replicas = self._replicas.count(
states=[ReplicaState.STARTING, ReplicaState.UPDATING,
ReplicaState.RUNNING]
       # PENDING_MIGRATION not included!
   )
   ```
4. **STEP 5** calls `check_curr_status()` but the status **remains
HEALTHY**

**In the next update cycle:**

1. **STEP 1** runs the rank consistency check because `status ==
HEALTHY`
2. The check includes the `STARTING` replica (via `self._replicas.get()`
which returns all states)
3. But `STARTING` replicas don't have ranks - ranks are only assigned
when transitioning to `RUNNING`
4. **ERROR: "Found active keys without ranks"**

---

### Bug: `check_curr_status()` doesn't transition away from HEALTHY

When `STARTING` replicas exist, `check_curr_status()` does not update
the status:

```python
# Line 2971-2998 in deployment_state.py
if (
    self._replicas.count(
        states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...]
    )
    == 0  # Only enters block if NO STARTING replicas
):
    if (target == running):
        self._curr_status_info = ... HEALTHY

return False, any_replicas_recovering  # If STARTING exists, just returns without changing status!
```

The method only transitions **TO** `HEALTHY`, but never transitions
**AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas
are added after the deployment was healthy, the status incorrectly
remains `HEALTHY`.

Not sure why this was the decided behavior, but I presume that we did it
because we don't want to show this migration replica state to the users
as the deployment status.

---

## Fix

This PR fixes the bug by adding an explicit check to skip the rank
consistency check when `STARTING` replicas exist:

```python
active_replicas = self._replicas.get()
if (
    active_replicas
    and self._curr_status_info.status == DeploymentStatus.HEALTHY
    # Skip consistency check if there are STARTING replicas. During node
    # migration, new replicas are created in STARTING state (without ranks)
    # after the status is set to HEALTHY. Running the consistency check
    # with STARTING replicas causes "active keys without ranks" error.
    and self._replicas.count(states=[ReplicaState.STARTING]) == 0
):
```

This is being done under the assumption that we want to call the rank
consistency function with running replicas only(they shouldn't have any
starting replicas in it).

Signed-off-by: harshit <harshit@anyscale.com>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…-project#60365)

## Summary

Fixes the "Found active keys without ranks" error that occurs during
node draining/migration when the rank consistency check runs with
STARTING replicas that don't have ranks assigned yet.

---

## Problem Description

When a node is drained, the Ray Serve controller migrates replicas to
other nodes. During this migration, users encounter the following error:

```
ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug.
ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state.
```

This error occurs because of a race condition between replica creation
and the rank consistency check.

---

## Root Cause Analysis

The issue stems from the order of operations in the
`DeploymentStateManager.update()` cycle:

```
STEP 1: check_and_update_replicas()    ← Contains rank consistency check
STEP 2: check_curr_status()            ← Sets status to HEALTHY
STEP 3: migrate_replicas_on_draining_nodes()  ← Moves replica to PENDING_MIGRATION
STEP 4: scale_deployment_replicas()    ← Creates new STARTING replica
STEP 5: check_curr_status()            ← **Should update status, but doesn't**
```

**During node migration:**

1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running,
target met)
2. **STEP 3** moves the replica on the draining node from `RUNNING` →
`PENDING_MIGRATION`
3. **STEP 4** creates a replacement replica in `STARTING` state because
`PENDING_MIGRATION` is not counted in `current_replicas`:
   ```python
   current_replicas = self._replicas.count(
states=[ReplicaState.STARTING, ReplicaState.UPDATING,
ReplicaState.RUNNING]
       # PENDING_MIGRATION not included!
   )
   ```
4. **STEP 5** calls `check_curr_status()` but the status **remains
HEALTHY**

**In the next update cycle:**

1. **STEP 1** runs the rank consistency check because `status ==
HEALTHY`
2. The check includes the `STARTING` replica (via `self._replicas.get()`
which returns all states)
3. But `STARTING` replicas don't have ranks - ranks are only assigned
when transitioning to `RUNNING`
4. **ERROR: "Found active keys without ranks"**

---

### Bug: `check_curr_status()` doesn't transition away from HEALTHY

When `STARTING` replicas exist, `check_curr_status()` does not update
the status:

```python
# Line 2971-2998 in deployment_state.py
if (
    self._replicas.count(
        states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...]
    )
    == 0  # Only enters block if NO STARTING replicas
):
    if (target == running):
        self._curr_status_info = ... HEALTHY

return False, any_replicas_recovering  # If STARTING exists, just returns without changing status!
```

The method only transitions **TO** `HEALTHY`, but never transitions
**AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas
are added after the deployment was healthy, the status incorrectly
remains `HEALTHY`.

Not sure why this was the decided behavior, but I presume that we did it
because we don't want to show this migration replica state to the users
as the deployment status.

---

## Fix

This PR fixes the bug by adding an explicit check to skip the rank
consistency check when `STARTING` replicas exist:

```python
active_replicas = self._replicas.get()
if (
    active_replicas
    and self._curr_status_info.status == DeploymentStatus.HEALTHY
    # Skip consistency check if there are STARTING replicas. During node
    # migration, new replicas are created in STARTING state (without ranks)
    # after the status is set to HEALTHY. Running the consistency check
    # with STARTING replicas causes "active keys without ranks" error.
    and self._replicas.count(states=[ReplicaState.STARTING]) == 0
):
```

This is being done under the assumption that we want to call the rank
consistency function with running replicas only(they shouldn't have any
starting replicas in it).

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
Aydin-ab pushed a commit to kunling-anyscale/ray that referenced this pull request Feb 20, 2026
…-project#60365)

## Summary

Fixes the "Found active keys without ranks" error that occurs during
node draining/migration when the rank consistency check runs with
STARTING replicas that don't have ranks assigned yet.

---

## Problem Description

When a node is drained, the Ray Serve controller migrates replicas to
other nodes. During this migration, users encounter the following error:

```
ERROR controller -- Found active keys without ranks: {'2vuuwu18'}. This should never happen. Please report this as a bug.
ERROR controller -- Error executing function _check_rank_consistency_impl: Rank system is in an invalid state.
```

This error occurs because of a race condition between replica creation
and the rank consistency check.

---

## Root Cause Analysis

The issue stems from the order of operations in the
`DeploymentStateManager.update()` cycle:

```
STEP 1: check_and_update_replicas()    ← Contains rank consistency check
STEP 2: check_curr_status()            ← Sets status to HEALTHY
STEP 3: migrate_replicas_on_draining_nodes()  ← Moves replica to PENDING_MIGRATION
STEP 4: scale_deployment_replicas()    ← Creates new STARTING replica
STEP 5: check_curr_status()            ← **Should update status, but doesn't**
```

**During node migration:**

1. **STEP 2** sets deployment status to `HEALTHY` (all replicas running,
target met)
2. **STEP 3** moves the replica on the draining node from `RUNNING` →
`PENDING_MIGRATION`
3. **STEP 4** creates a replacement replica in `STARTING` state because
`PENDING_MIGRATION` is not counted in `current_replicas`:
   ```python
   current_replicas = self._replicas.count(
states=[ReplicaState.STARTING, ReplicaState.UPDATING,
ReplicaState.RUNNING]
       # PENDING_MIGRATION not included!
   )
   ```
4. **STEP 5** calls `check_curr_status()` but the status **remains
HEALTHY**

**In the next update cycle:**

1. **STEP 1** runs the rank consistency check because `status ==
HEALTHY`
2. The check includes the `STARTING` replica (via `self._replicas.get()`
which returns all states)
3. But `STARTING` replicas don't have ranks - ranks are only assigned
when transitioning to `RUNNING`
4. **ERROR: "Found active keys without ranks"**

---

### Bug: `check_curr_status()` doesn't transition away from HEALTHY

When `STARTING` replicas exist, `check_curr_status()` does not update
the status:

```python
# Line 2971-2998 in deployment_state.py
if (
    self._replicas.count(
        states=[ReplicaState.STARTING, ReplicaState.UPDATING, ...]
    )
    == 0  # Only enters block if NO STARTING replicas
):
    if (target == running):
        self._curr_status_info = ... HEALTHY

return False, any_replicas_recovering  # If STARTING exists, just returns without changing status!
```

The method only transitions **TO** `HEALTHY`, but never transitions
**AWAY** from `HEALTHY` when conditions change. If `STARTING` replicas
are added after the deployment was healthy, the status incorrectly
remains `HEALTHY`.

Not sure why this was the decided behavior, but I presume that we did it
because we don't want to show this migration replica state to the users
as the deployment status.

---

## Fix

This PR fixes the bug by adding an explicit check to skip the rank
consistency check when `STARTING` replicas exist:

```python
active_replicas = self._replicas.get()
if (
    active_replicas
    and self._curr_status_info.status == DeploymentStatus.HEALTHY
    # Skip consistency check if there are STARTING replicas. During node
    # migration, new replicas are created in STARTING state (without ranks)
    # after the status is set to HEALTHY. Running the consistency check
    # with STARTING replicas causes "active keys without ranks" error.
    and self._replicas.count(states=[ReplicaState.STARTING]) == 0
):
```

This is being done under the assumption that we want to call the rank
consistency function with running replicas only(they shouldn't have any
starting replicas in it).

Signed-off-by: harshit <harshit@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

2 participants