Skip to content

[2/n] [Serve] Refactor replica rank to prepare for node local ranks#58473

Merged
abrarsheikh merged 10 commits into
masterfrom
LLM-2497-abrar-rank-p2
Nov 14, 2025
Merged

[2/n] [Serve] Refactor replica rank to prepare for node local ranks#58473
abrarsheikh merged 10 commits into
masterfrom
LLM-2497-abrar-rank-p2

Conversation

@abrarsheikh

@abrarsheikh abrarsheikh commented Nov 8, 2025

Copy link
Copy Markdown
Contributor

Summary

This PR refactors the replica rank system to support multi-dimensional ranking (global, node-level, and local ranks) in preparation for node-local rank tracking. The ReplicaRank object now contains three fields instead of being a simple integer, enabling better coordination of replicas across nodes.

Motivation

Currently, Ray Serve only tracks a single global rank per replica. For advanced use cases like tensor parallelism, model sharding across nodes, and node-aware coordination, we need to track:

  • Global rank: Replica's rank across all nodes (0 to N-1)
  • Node rank: Which node the replica is on (0 to M-1)
  • Local rank: Replica's rank on its specific node (0 to K-1)

This PR lays the groundwork by introducing the expanded ReplicaRank schema while maintaining backward compatibility in feature.

Changes

Core Implementation

  • schema.py: Extended ReplicaRank to include node_rank and local_rank fields (currently set to -1 as placeholders)
  • replica.py: Updated replica actors to handle ReplicaRank objects
  • context.py: Changed ReplicaContext.rank type from Optional[int] to ReplicaRank

Current Behavior

  • node_rank and local_rank are set to -1 (placeholder values). Will change in future
  • Global rank assignment and management works as before
  • All existing functionality is preserved

Breaking Changes

Rank is changing from int to ReplicaRank

Next PR #58477

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Nov 8, 2025
@abrarsheikh abrarsheikh changed the title [Serve] Refactor replica rank to prepare for node local ranks Nov 8, 2025

@kouroshHakha kouroshHakha left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make sense. Needs more detailed review from serve team

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
abrarsheikh added a commit that referenced this pull request Nov 13, 2025
…58471)

2. **Extracted generic `RankManager` class** - Created reusable rank
management logic separated from deployment-specific concerns

3. **Introduced `ReplicaRank` schema** - Type-safe rank representation
replacing raw integers

4. **Simplified error handling** - not supporting self healing

5. **Updated tests** - Refactored unit tests to use new API and removed
flag-dependent test cases

**Impact:**
- Cleaner separation of concerns in rank management
- Foundation for future multi-level rank support


Next PR #58473

---------

Signed-off-by: abrar <abrar@anyscale.com>
Base automatically changed from LLM-2497-abrar-rank-p1 to master November 13, 2025 19:41
@abrarsheikh abrarsheikh marked this pull request as ready for review November 13, 2025 20:05
@abrarsheikh abrarsheikh requested review from a team as code owners November 13, 2025 20:05
@abrarsheikh abrarsheikh requested a review from zcin November 13, 2025 20:06

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Replica Rank: Reconfiguration Parameter Type Error

In _stop_or_update_outdated_version_replicas, the code passes current_rank.rank (an integer) to replica.reconfigure(), but the method signature expects a ReplicaRank object. This causes a type mismatch since current_rank is already a ReplicaRank object returned from get_replica_rank(). The code should pass current_rank directly instead of extracting its .rank field.

python/ray/serve/_private/deployment_state.py#L2415-L2421

# Get current rank for the replica
current_rank = self._rank_manager.get_replica_rank(
replica.replica_id.unique_id
)
actor_updating = replica.reconfigure(
self._target_state.version, rank=current_rank.rank
)

Fix in Cursor Fix in Web


Comment thread python/ray/serve/_private/deployment_state.py
@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label Nov 14, 2025
@abrarsheikh abrarsheikh merged commit 80eb240 into master Nov 14, 2025
6 checks passed
@abrarsheikh abrarsheikh deleted the LLM-2497-abrar-rank-p2 branch November 14, 2025 06:05
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ay-project#58471)

2. **Extracted generic `RankManager` class** - Created reusable rank
management logic separated from deployment-specific concerns

3. **Introduced `ReplicaRank` schema** - Type-safe rank representation
replacing raw integers

4. **Simplified error handling** - not supporting self healing

5. **Updated tests** - Refactored unit tests to use new API and removed
flag-dependent test cases

**Impact:**
- Cleaner separation of concerns in rank management
- Foundation for future multi-level rank support


Next PR ray-project#58473

---------

Signed-off-by: abrar <abrar@anyscale.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…ay-project#58471)

2. **Extracted generic `RankManager` class** - Created reusable rank
management logic separated from deployment-specific concerns

3. **Introduced `ReplicaRank` schema** - Type-safe rank representation
replacing raw integers

4. **Simplified error handling** - not supporting self healing

5. **Updated tests** - Refactored unit tests to use new API and removed
flag-dependent test cases

**Impact:**
- Cleaner separation of concerns in rank management
- Foundation for future multi-level rank support


Next PR ray-project#58473

---------

Signed-off-by: abrar <abrar@anyscale.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…ay-project#58473)

### Summary
This PR refactors the replica rank system to support multi-dimensional
ranking (global, node-level, and local ranks) in preparation for
node-local rank tracking. The `ReplicaRank` object now contains three
fields instead of being a simple integer, enabling better coordination
of replicas across nodes.

### Motivation
Currently, Ray Serve only tracks a single global rank per replica. For
advanced use cases like tensor parallelism, model sharding across nodes,
and node-aware coordination, we need to track:
- **Global rank**: Replica's rank across all nodes (0 to N-1)
- **Node rank**: Which node the replica is on (0 to M-1) 
- **Local rank**: Replica's rank on its specific node (0 to K-1)

This PR lays the groundwork by introducing the expanded `ReplicaRank`
schema while maintaining backward compatibility in feature.

### Changes

#### Core Implementation
- **`schema.py`**: Extended `ReplicaRank` to include `node_rank` and
`local_rank` fields (currently set to -1 as placeholders)
- **`replica.py`**: Updated replica actors to handle `ReplicaRank`
objects
- **`context.py`**: Changed `ReplicaContext.rank` type from
`Optional[int]` to `ReplicaRank`

### Current Behavior
- `node_rank` and `local_rank` are set to `-1` (placeholder values).
Will change in future
- Global rank assignment and management works as before
- All existing functionality is preserved

### Breaking Changes
Rank is changing from `int` to `ReplicaRank`

Next PR ray-project#58477

---------

Signed-off-by: abrar <abrar@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ay-project#58471)

2. **Extracted generic `RankManager` class** - Created reusable rank
management logic separated from deployment-specific concerns

3. **Introduced `ReplicaRank` schema** - Type-safe rank representation
replacing raw integers

4. **Simplified error handling** - not supporting self healing

5. **Updated tests** - Refactored unit tests to use new API and removed
flag-dependent test cases

**Impact:**
- Cleaner separation of concerns in rank management
- Foundation for future multi-level rank support

Next PR ray-project#58473

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ay-project#58473)

### Summary
This PR refactors the replica rank system to support multi-dimensional
ranking (global, node-level, and local ranks) in preparation for
node-local rank tracking. The `ReplicaRank` object now contains three
fields instead of being a simple integer, enabling better coordination
of replicas across nodes.

### Motivation
Currently, Ray Serve only tracks a single global rank per replica. For
advanced use cases like tensor parallelism, model sharding across nodes,
and node-aware coordination, we need to track:
- **Global rank**: Replica's rank across all nodes (0 to N-1)
- **Node rank**: Which node the replica is on (0 to M-1)
- **Local rank**: Replica's rank on its specific node (0 to K-1)

This PR lays the groundwork by introducing the expanded `ReplicaRank`
schema while maintaining backward compatibility in feature.

### Changes

#### Core Implementation
- **`schema.py`**: Extended `ReplicaRank` to include `node_rank` and
`local_rank` fields (currently set to -1 as placeholders)
- **`replica.py`**: Updated replica actors to handle `ReplicaRank`
objects
- **`context.py`**: Changed `ReplicaContext.rank` type from
`Optional[int]` to `ReplicaRank`

### Current Behavior
- `node_rank` and `local_rank` are set to `-1` (placeholder values).
Will change in future
- Global rank assignment and management works as before
- All existing functionality is preserved

### Breaking Changes
Rank is changing from `int` to `ReplicaRank`

Next PR ray-project#58477

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

3 participants