[2/n] [Serve] Refactor replica rank to prepare for node local ranks#58473
Merged
Conversation
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
kouroshHakha
approved these changes
Nov 11, 2025
kouroshHakha
left a comment
Contributor
There was a problem hiding this comment.
The changes make sense. Needs more detailed review from serve team
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
abrarsheikh
added a commit
that referenced
this pull request
Nov 13, 2025
…58471) 2. **Extracted generic `RankManager` class** - Created reusable rank management logic separated from deployment-specific concerns 3. **Introduced `ReplicaRank` schema** - Type-safe rank representation replacing raw integers 4. **Simplified error handling** - not supporting self healing 5. **Updated tests** - Refactored unit tests to use new API and removed flag-dependent test cases **Impact:** - Cleaner separation of concerns in rank management - Foundation for future multi-level rank support Next PR #58473 --------- Signed-off-by: abrar <abrar@anyscale.com>
There was a problem hiding this comment.
Bug: Replica Rank: Reconfiguration Parameter Type Error
In _stop_or_update_outdated_version_replicas, the code passes current_rank.rank (an integer) to replica.reconfigure(), but the method signature expects a ReplicaRank object. This causes a type mismatch since current_rank is already a ReplicaRank object returned from get_replica_rank(). The code should pass current_rank directly instead of extracting its .rank field.
python/ray/serve/_private/deployment_state.py#L2415-L2421
ray/python/ray/serve/_private/deployment_state.py
Lines 2415 to 2421 in 372819f
zcin
reviewed
Nov 13, 2025
zcin
approved these changes
Nov 13, 2025
landscapepainter
pushed a commit
to landscapepainter/ray
that referenced
this pull request
Nov 17, 2025
…ay-project#58471) 2. **Extracted generic `RankManager` class** - Created reusable rank management logic separated from deployment-specific concerns 3. **Introduced `ReplicaRank` schema** - Type-safe rank representation replacing raw integers 4. **Simplified error handling** - not supporting self healing 5. **Updated tests** - Refactored unit tests to use new API and removed flag-dependent test cases **Impact:** - Cleaner separation of concerns in rank management - Foundation for future multi-level rank support Next PR ray-project#58473 --------- Signed-off-by: abrar <abrar@anyscale.com>
SheldonTsen
pushed a commit
to SheldonTsen/ray
that referenced
this pull request
Dec 1, 2025
…ay-project#58471) 2. **Extracted generic `RankManager` class** - Created reusable rank management logic separated from deployment-specific concerns 3. **Introduced `ReplicaRank` schema** - Type-safe rank representation replacing raw integers 4. **Simplified error handling** - not supporting self healing 5. **Updated tests** - Refactored unit tests to use new API and removed flag-dependent test cases **Impact:** - Cleaner separation of concerns in rank management - Foundation for future multi-level rank support Next PR ray-project#58473 --------- Signed-off-by: abrar <abrar@anyscale.com>
SheldonTsen
pushed a commit
to SheldonTsen/ray
that referenced
this pull request
Dec 1, 2025
…ay-project#58473) ### Summary This PR refactors the replica rank system to support multi-dimensional ranking (global, node-level, and local ranks) in preparation for node-local rank tracking. The `ReplicaRank` object now contains three fields instead of being a simple integer, enabling better coordination of replicas across nodes. ### Motivation Currently, Ray Serve only tracks a single global rank per replica. For advanced use cases like tensor parallelism, model sharding across nodes, and node-aware coordination, we need to track: - **Global rank**: Replica's rank across all nodes (0 to N-1) - **Node rank**: Which node the replica is on (0 to M-1) - **Local rank**: Replica's rank on its specific node (0 to K-1) This PR lays the groundwork by introducing the expanded `ReplicaRank` schema while maintaining backward compatibility in feature. ### Changes #### Core Implementation - **`schema.py`**: Extended `ReplicaRank` to include `node_rank` and `local_rank` fields (currently set to -1 as placeholders) - **`replica.py`**: Updated replica actors to handle `ReplicaRank` objects - **`context.py`**: Changed `ReplicaContext.rank` type from `Optional[int]` to `ReplicaRank` ### Current Behavior - `node_rank` and `local_rank` are set to `-1` (placeholder values). Will change in future - Global rank assignment and management works as before - All existing functionality is preserved ### Breaking Changes Rank is changing from `int` to `ReplicaRank` Next PR ray-project#58477 --------- Signed-off-by: abrar <abrar@anyscale.com>
Future-Outlier
pushed a commit
to Future-Outlier/ray
that referenced
this pull request
Dec 7, 2025
…ay-project#58471) 2. **Extracted generic `RankManager` class** - Created reusable rank management logic separated from deployment-specific concerns 3. **Introduced `ReplicaRank` schema** - Type-safe rank representation replacing raw integers 4. **Simplified error handling** - not supporting self healing 5. **Updated tests** - Refactored unit tests to use new API and removed flag-dependent test cases **Impact:** - Cleaner separation of concerns in rank management - Foundation for future multi-level rank support Next PR ray-project#58473 --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Future-Outlier
pushed a commit
to Future-Outlier/ray
that referenced
this pull request
Dec 7, 2025
…ay-project#58473) ### Summary This PR refactors the replica rank system to support multi-dimensional ranking (global, node-level, and local ranks) in preparation for node-local rank tracking. The `ReplicaRank` object now contains three fields instead of being a simple integer, enabling better coordination of replicas across nodes. ### Motivation Currently, Ray Serve only tracks a single global rank per replica. For advanced use cases like tensor parallelism, model sharding across nodes, and node-aware coordination, we need to track: - **Global rank**: Replica's rank across all nodes (0 to N-1) - **Node rank**: Which node the replica is on (0 to M-1) - **Local rank**: Replica's rank on its specific node (0 to K-1) This PR lays the groundwork by introducing the expanded `ReplicaRank` schema while maintaining backward compatibility in feature. ### Changes #### Core Implementation - **`schema.py`**: Extended `ReplicaRank` to include `node_rank` and `local_rank` fields (currently set to -1 as placeholders) - **`replica.py`**: Updated replica actors to handle `ReplicaRank` objects - **`context.py`**: Changed `ReplicaContext.rank` type from `Optional[int]` to `ReplicaRank` ### Current Behavior - `node_rank` and `local_rank` are set to `-1` (placeholder values). Will change in future - Global rank assignment and management works as before - All existing functionality is preserved ### Breaking Changes Rank is changing from `int` to `ReplicaRank` Next PR ray-project#58477 --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR refactors the replica rank system to support multi-dimensional ranking (global, node-level, and local ranks) in preparation for node-local rank tracking. The
ReplicaRankobject now contains three fields instead of being a simple integer, enabling better coordination of replicas across nodes.Motivation
Currently, Ray Serve only tracks a single global rank per replica. For advanced use cases like tensor parallelism, model sharding across nodes, and node-aware coordination, we need to track:
This PR lays the groundwork by introducing the expanded
ReplicaRankschema while maintaining backward compatibility in feature.Changes
Core Implementation
schema.py: ExtendedReplicaRankto includenode_rankandlocal_rankfields (currently set to -1 as placeholders)replica.py: Updated replica actors to handleReplicaRankobjectscontext.py: ChangedReplicaContext.ranktype fromOptional[int]toReplicaRankCurrent Behavior
node_rankandlocal_rankare set to-1(placeholder values). Will change in futureBreaking Changes
Rank is changing from
inttoReplicaRankNext PR #58477