[serve][3/n] Gang scheduling -- core scheduling engine#61206
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the core logic for gang scheduling in Ray Serve, a significant feature that enables atomic scheduling of replica groups. The changes span across the deployment scheduler, state machine, and replica logic, and are well-supported by a comprehensive suite of new tests.
The implementation correctly introduces a new step for reserving placement groups, updates the scheduling logic to utilize them, and handles gang-level failures to ensure atomicity. The approach is robust and the code is generally of high quality.
I have one suggestion to improve error logging to prevent potential resource leaks from being silently ignored.
4a67c66 to
a83f768
Compare
adfcd37 to
d983408
Compare
a83f768 to
3d92384
Compare
3d92384 to
5678685
Compare
abrarsheikh
left a comment
There was a problem hiding this comment.
let's add a few more integration tests, ignore if any of these already exist
- serve.delete() on a gang app, verify PGs are cleaned up?
- Multiple gang deployments in one app
- One replica in a gang fails during startup. Both replicas in that gang are stopped; no partial gang left running.
- Running gang replica fails health check or crashes. Whole gang is torn down and restarted
9548459 to
6310a02
Compare
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
|
@abrarsheikh I addressed your comments regarding to tests -- ready for another pass. |
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| # Forcefully stop siblings to avoid partial gangs | ||
| self._stop_replica(replica, graceful_stop=False) | ||
| else: | ||
| self._replicas.add(state, replica) |
There was a problem hiding this comment.
Startup gang cleanup misses PENDING_MIGRATION siblings
Medium Severity
The gang sibling cleanup in _check_startup_replicas iterates over {original_state, ReplicaState.RUNNING} but omits ReplicaState.PENDING_MIGRATION. This is inconsistent with the health-check gang cleanup in check_and_update_replicas, which correctly iterates over [ReplicaState.RUNNING, ReplicaState.PENDING_MIGRATION]. If a gang member succeeds startup and transitions to RUNNING, then gets migrated to PENDING_MIGRATION, and its sibling subsequently fails startup, the migrating sibling won't be stopped — leaving a partial gang running.
Additional Locations (1)
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
|
@abrarsheikh I addressed your comments except for #61206 (comment) and also adjusted startup failure counting logic for gangs. Previously, in gang scheduling, each replica startup failure is counted towards the threshold, but I think counting failure per gang makes more sense. Startup failure occurs when there's an allocation (e.g. insufficient resources) or initialization (e.g. actor initialization) issue, and replicas in a gang could run into issues with the same root cause, and therefore the previous approach will inflate the failure count by gang_size. |
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
|
Addressed all comments :) |
## Description Implements the core gang scheduling logic: Gang-scheduled deployments atomically reserve placement groups for groups of replicas and start them together, ensuring all members of a gang are co-scheduled or none are. ### Approach - Scheduler (`deployment_scheduler.py`) - Added `schedule_gang_placement_groups` to DeploymentScheduler. - The default scheduler now creates named gang placement groups and assigns replica ranks within each gang. - Replica scheduling checks for a gang placement group first, and falls back to per-replica placement if none exists. - Gang reservation results are passed to the deployment state machine. - State Machine (`deployment_state.py`) - Introduced a new step in the update loop to reserve gang placement groups. - Added `_add_replicas_with_gang_scheduling()` to start replicas with gang context (gang_id, rank, world_size, member_replica_ids). - If any replica in a gang fails during startup, all replicas in that gang are stopped. - Gracefully handles placement group removal failures for shared gang placement groups. - Replica (`replica.py`) - Extended `ReplicaMetadata` to include `GangContext`. - ActorReplicaWrapper now stores and exposes gang context and passes it through `check_ready()`. ## Related issues RFC: #60873 Precedent: #61205 Original PR: #60802 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: jeffreywang <jeffreywang@anyscale.com> Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: abrar <abrar@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
## Description Adds fault tolerance for gang-scheduled deployments. - Implement RESTART_GANG runtime failure policy. - Exercise leaked gang placement group detection after controller recovery. ### Approach - Implement `RESTART_GANG` policy within health check handling (`deployment_state.py`) - Refactored the health check loop to track healthy and unhealthy replicas separately. - When RESTART_GANG is enabled and a replica fails its health check, all replicas, including the unhealthy ones and their healthy siblings, in the same gang are force-stopped so the entire gang can be rescheduled together. - Exercise leaked gang placement group detection `_detect_and_remove_leaked_placement_groups` - Extended existing leak detection to support gang placement groups. - A gang placement group is considered leaked only if no active actors reference it. Placement groups with live actors are preserved to avoid prematurely releasing resources. - GCS PG query failures are handled gracefully by skipping the leaked gang PG detection. ### Test Plan #### Unit Tests | Category | Test | Description | |-----------|------|-------------| | GangReservationResult fields | `TestScheduleGangPlacementGroups ::test_schedule_gang_placement_groups` | Calls real scheduler; asserts length, uniqueness, and `GANG_PG_NAME_PREFIX` | | GangReservationResult fields | `TestScaleDeploymentGangReplicas ::test_successful_gang_reservation` | Mocks result with `gang_ids` and `gang_pg_names`; asserts `gang_context.pg_name` in `gang_pg_names` | | Gang-aware `check_and_update_replicas` | `TestGangHealthCheck ::test_restart_gang_entire_gang_stopped` | Unhealthy replica → entire owning gang force-stopped & healthy gangs unaffected | | Gang-aware `check_and_update_replicas` | `TestGangHealthCheck ::test_restart_gang_force_stop_all_gang_replicas` | Unhealthy gang replicas are force-stopped regardless of `FORCE_STOP_UNHEALTHY_REPLICAS` | | Gang-aware `check_and_update_replicas` | `TestGangHealthCheck ::test_restart_gang_multiple_unhealthy_gang_replicas` | Multiple unhealthy replicas in same gang; verifies deduplication | | Gang-aware `check_and_update_replicas` | `TestGangHealthCheck ::test_restart_gang_multiple_gangs_failing` | Multiple gangs with unhealthy replicas are all stopped; verifies set accumulation | #### Integration Tests | Test | Description | |-------|----------| | `test_gang_health_check_restarts_gang` | Health check failure -> entire gang is torn down while surviving gang continues serving traffic with zero downtime -> deployment recovers to HEALTHY and both failed replicas are replaced | | `test_leaked_gang_pg_removed_on_controller_recovery` | Kill replicas on a gang PG -> restart controller -> leaked gang PG is detected and removed -> zero downtime throughout | | `TestGangControllerRecovery::test_gang_context_recovery` | Coexisting gang and non-gang deployments -> kill the controller -> GangContext and ReplicaContext are recovered -> apps / deployments return to RUNNING / HEALTHY state | | `TestGangPGLeakDetection ::test_gcs_failure_skip_pg_leak_detection` | GCS query failure -> cleanup skipped | #### Learnings from preceding PR - Integration tests now assert both deployment and app statuses - DeploymentScheduler tests now proceed the state machine to ensure that deployment returns to HEALTHY ## Related issues RFC: #60873 Precedent: #61206 Original PR: #60802 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: jeffreywang <jeffreywang@anyscale.com> Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>


Description
Implements the core gang scheduling logic: Gang-scheduled deployments atomically reserve placement groups for groups of replicas and start them together, ensuring all members of a gang are co-scheduled or none are.
Approach
Scheduler (
deployment_scheduler.py)schedule_gang_placement_groupsto DeploymentScheduler.State Machine (
deployment_state.py)_add_replicas_with_gang_scheduling()to start replicas with gang context (gang_id, rank, world_size, member_replica_ids).Replica (
replica.py)ReplicaMetadatato includeGangContext.check_ready().Related issues
RFC: #60873
Precedent: #61205
Original PR: #60802
Additional information