[serve][3/n] Gang scheduling -- core scheduling engine by jeffreywang88 · Pull Request #61206 · ray-project/ray

jeffreywang88 · 2026-02-20T19:25:50Z

Description

Implements the core gang scheduling logic: Gang-scheduled deployments atomically reserve placement groups for groups of replicas and start them together, ensuring all members of a gang are co-scheduled or none are.

Approach

Scheduler (deployment_scheduler.py)
- Added schedule_gang_placement_groups to DeploymentScheduler.
- The default scheduler now creates named gang placement groups and assigns replica ranks within each gang.
- Replica scheduling checks for a gang placement group first, and falls back to per-replica placement if none exists.
- Gang reservation results are passed to the deployment state machine.
State Machine (deployment_state.py)
- Introduced a new step in the update loop to reserve gang placement groups.
- Added _add_replicas_with_gang_scheduling() to start replicas with gang context (gang_id, rank, world_size, member_replica_ids).
- If any replica in a gang fails during startup, all replicas in that gang are stopped.
- Gracefully handles placement group removal failures for shared gang placement groups.
Replica (replica.py)
- Extended ReplicaMetadata to include GangContext.
- ActorReplicaWrapper now stores and exposes gang context and passes it through check_ready().

Related issues

RFC: #60873
Precedent: #61205
Original PR: #60802

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

gemini-code-assist

Code Review

This pull request introduces the core logic for gang scheduling in Ray Serve, a significant feature that enables atomic scheduling of replica groups. The changes span across the deployment scheduler, state machine, and replica logic, and are well-supported by a comprehensive suite of new tests.

The implementation correctly introduces a new step for reserving placement groups, updates the scheduling logic to utilize them, and handles gang-level failures to ensure atomicity. The approach is robust and the code is generally of high quality.

I have one suggestion to improve error logging to prevent potential resource leaks from being silently ignored.

abrarsheikh

let's add a few more integration tests, ignore if any of these already exist

serve.delete() on a gang app, verify PGs are cleaned up?
Multiple gang deployments in one app
One replica in a gang fails during startup. Both replicas in that gang are stopped; no partial gang left running.
Running gang replica fails health check or crashes. Whole gang is torn down and restarted

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

jeffreywang88 · 2026-02-26T20:34:23Z

@abrarsheikh I addressed your comments regarding to tests -- ready for another pass.

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

…ing-part2-core

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

cursor · 2026-02-28T01:03:42Z

+                        # Forcefully stop siblings to avoid partial gangs
+                        self._stop_replica(replica, graceful_stop=False)
+                    else:
+                        self._replicas.add(state, replica)


Startup gang cleanup misses PENDING_MIGRATION siblings

Medium Severity

The gang sibling cleanup in _check_startup_replicas iterates over {original_state, ReplicaState.RUNNING} but omits ReplicaState.PENDING_MIGRATION. This is inconsistent with the health-check gang cleanup in check_and_update_replicas, which correctly iterates over [ReplicaState.RUNNING, ReplicaState.PENDING_MIGRATION]. If a gang member succeeds startup and transitions to RUNNING, then gets migrated to PENDING_MIGRATION, and its sibling subsequently fails startup, the migrating sibling won't be stopped — leaving a partial gang running.

Additional Locations (1)

python/ray/serve/_private/deployment_state.py#L3617-L3633

This is resolved in #61216

abrarsheikh

nits

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang88 · 2026-03-01T07:17:21Z

@abrarsheikh I addressed your comments except for #61206 (comment) and also adjusted startup failure counting logic for gangs.

Previously, in gang scheduling, each replica startup failure is counted towards the threshold, but I think counting failure per gang makes more sense. Startup failure occurs when there's an allocation (e.g. insufficient resources) or initialization (e.g. actor initialization) issue, and replicas in a gang could run into issues with the same root cause, and therefore the previous approach will inflate the failure count by gang_size.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang88 · 2026-03-01T07:46:40Z

Addressed all comments :)

## Description Implements the core gang scheduling logic: Gang-scheduled deployments atomically reserve placement groups for groups of replicas and start them together, ensuring all members of a gang are co-scheduled or none are. ### Approach - Scheduler (`deployment_scheduler.py`) - Added `schedule_gang_placement_groups` to DeploymentScheduler. - The default scheduler now creates named gang placement groups and assigns replica ranks within each gang. - Replica scheduling checks for a gang placement group first, and falls back to per-replica placement if none exists. - Gang reservation results are passed to the deployment state machine. - State Machine (`deployment_state.py`) - Introduced a new step in the update loop to reserve gang placement groups. - Added `_add_replicas_with_gang_scheduling()` to start replicas with gang context (gang_id, rank, world_size, member_replica_ids). - If any replica in a gang fails during startup, all replicas in that gang are stopped. - Gracefully handles placement group removal failures for shared gang placement groups. - Replica (`replica.py`) - Extended `ReplicaMetadata` to include `GangContext`. - ActorReplicaWrapper now stores and exposes gang context and passes it through `check_ready()`. ## Related issues RFC: #60873 Precedent: #61205 Original PR: #60802 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: jeffreywang <jeffreywang@anyscale.com> Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: abrar <abrar@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

## Description Adds fault tolerance for gang-scheduled deployments. - Implement RESTART_GANG runtime failure policy. - Exercise leaked gang placement group detection after controller recovery. ### Approach - Implement `RESTART_GANG` policy within health check handling (`deployment_state.py`) - Refactored the health check loop to track healthy and unhealthy replicas separately. - When RESTART_GANG is enabled and a replica fails its health check, all replicas, including the unhealthy ones and their healthy siblings, in the same gang are force-stopped so the entire gang can be rescheduled together. - Exercise leaked gang placement group detection `_detect_and_remove_leaked_placement_groups` - Extended existing leak detection to support gang placement groups. - A gang placement group is considered leaked only if no active actors reference it. Placement groups with live actors are preserved to avoid prematurely releasing resources. - GCS PG query failures are handled gracefully by skipping the leaked gang PG detection. ### Test Plan #### Unit Tests | Category | Test | Description | |-----------|------|-------------| | GangReservationResult fields | `TestScheduleGangPlacementGroups ::test_schedule_gang_placement_groups` | Calls real scheduler; asserts length, uniqueness, and `GANG_PG_NAME_PREFIX` | | GangReservationResult fields | `TestScaleDeploymentGangReplicas ::test_successful_gang_reservation` | Mocks result with `gang_ids` and `gang_pg_names`; asserts `gang_context.pg_name` in `gang_pg_names` | | Gang-aware `check_and_update_replicas` | `TestGangHealthCheck ::test_restart_gang_entire_gang_stopped` | Unhealthy replica → entire owning gang force-stopped & healthy gangs unaffected | | Gang-aware `check_and_update_replicas` | `TestGangHealthCheck ::test_restart_gang_force_stop_all_gang_replicas` | Unhealthy gang replicas are force-stopped regardless of `FORCE_STOP_UNHEALTHY_REPLICAS` | | Gang-aware `check_and_update_replicas` | `TestGangHealthCheck ::test_restart_gang_multiple_unhealthy_gang_replicas` | Multiple unhealthy replicas in same gang; verifies deduplication | | Gang-aware `check_and_update_replicas` | `TestGangHealthCheck ::test_restart_gang_multiple_gangs_failing` | Multiple gangs with unhealthy replicas are all stopped; verifies set accumulation | #### Integration Tests | Test | Description | |-------|----------| | `test_gang_health_check_restarts_gang` | Health check failure -> entire gang is torn down while surviving gang continues serving traffic with zero downtime -> deployment recovers to HEALTHY and both failed replicas are replaced | | `test_leaked_gang_pg_removed_on_controller_recovery` | Kill replicas on a gang PG -> restart controller -> leaked gang PG is detected and removed -> zero downtime throughout | | `TestGangControllerRecovery::test_gang_context_recovery` | Coexisting gang and non-gang deployments -> kill the controller -> GangContext and ReplicaContext are recovered -> apps / deployments return to RUNNING / HEALTHY state | | `TestGangPGLeakDetection ::test_gcs_failure_skip_pg_leak_detection` | GCS query failure -> cleanup skipped | #### Learnings from preceding PR - Integration tests now assert both deployment and app statuses - DeploymentScheduler tests now proceed the state machine to ensure that deployment returns to HEALTHY ## Related issues RFC: #60873 Precedent: #61206 Original PR: #60802 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: jeffreywang <jeffreywang@anyscale.com> Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang88 requested a review from a team as a code owner February 20, 2026 19:25

jeffreywang88 changed the title ~~[serve][3/n] Gang scheduling - core scheduling engine~~ Feb 20, 2026

gemini-code-assist Bot reviewed Feb 20, 2026

View reviewed changes

Comment thread python/ray/serve/_private/deployment_state.py Outdated

jeffreywang88 mentioned this pull request Feb 20, 2026

[serve][4/n] Gang scheduling -- fault tolerance #61207

Merged

cursor Bot reviewed Feb 20, 2026

View reviewed changes

Comment thread python/ray/serve/_private/deployment_state.py

Comment thread python/ray/serve/_private/deployment_state.py Outdated

jeffreywang88 requested a review from abrarsheikh February 20, 2026 22:29

ray-gardener Bot added the community-contribution Contributed by the community label Feb 21, 2026

jeffreywang88 added serve Ray Serve Related Issue and removed community-contribution Contributed by the community labels Feb 21, 2026

jeffreywang88 force-pushed the gang-scheduling-part2-core branch from 4a67c66 to a83f768 Compare February 24, 2026 05:41

jeffreywang88 requested review from a team as code owners February 24, 2026 05:41

jeffreywang88 force-pushed the gang-scheduling-part1-validation branch from adfcd37 to d983408 Compare February 24, 2026 05:42

jeffreywang88 removed request for a team February 24, 2026 05:43

jeffreywang88 force-pushed the gang-scheduling-part2-core branch from a83f768 to 3d92384 Compare February 26, 2026 02:59

jeffreywang88 requested review from a team as code owners February 26, 2026 02:59

jeffreywang88 changed the base branch from gang-scheduling-part1-validation to master February 26, 2026 03:00

jeffreywang88 removed request for a team February 26, 2026 03:06

jeffreywang88 force-pushed the gang-scheduling-part2-core branch from 3d92384 to 5678685 Compare February 26, 2026 03:11

jeffreywang88 added the go add ONLY when ready to merge, run all tests label Feb 26, 2026

abrarsheikh reviewed Feb 26, 2026

View reviewed changes

jeffreywang88 force-pushed the gang-scheduling-part2-core branch from 9548459 to 6310a02 Compare February 26, 2026 19:08

Self-review

e232ac7

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

cursor Bot reviewed Feb 26, 2026

View reviewed changes

Comment thread python/ray/serve/_private/deployment_state.py

jeffreywang88 added 2 commits February 26, 2026 22:11

Fix CI

75b20de

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

Fix linter

84b99ae

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

cursor Bot reviewed Feb 26, 2026

View reviewed changes

Comment thread python/ray/serve/_private/deployment_state.py Outdated

Comment thread python/ray/serve/_private/deployment_state.py

Comment thread python/ray/serve/_private/deployment_state.py

abrarsheikh reviewed Feb 27, 2026

View reviewed changes

Merge branch 'master' of github.com:ray-project/ray into gang-schedul…

d32a3e1

…ing-part2-core

cursor Bot reviewed Feb 27, 2026

View reviewed changes

Comment thread python/ray/serve/_private/deployment_state.py

jeffreywang88 mentioned this pull request Feb 27, 2026

[serve] Unify serve test synchronization patterns using Ray actors #61387

Closed

jeffreywang88 added 2 commits February 27, 2026 21:04

Move tests to suitable files

7da9497

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

Refactor & let state machine proceed to healthy state in unit tests

1070b08

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

jeffreywang88 commented Feb 27, 2026

View reviewed changes

Comment thread python/ray/serve/_private/deployment_state.py Outdated

cursor Bot reviewed Feb 27, 2026

View reviewed changes

Comment thread python/ray/serve/_private/deployment_state.py

Patch leftover CI fixt

3c998ab

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

cursor Bot reviewed Feb 28, 2026

View reviewed changes

abrarsheikh reviewed Feb 28, 2026

View reviewed changes

Address nits & adjust gang startup failure count logic

cfa34d5

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Move schedule_gang_placement_groups to parent

8d70645

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

abrarsheikh approved these changes Mar 1, 2026

View reviewed changes

abrarsheikh merged commit b78afa9 into master Mar 1, 2026
6 checks passed

abrarsheikh deleted the gang-scheduling-part2-core branch March 1, 2026 20:59

Uh oh!

Conversation

jeffreywang88 commented Feb 20, 2026

Description

Approach

Related issues

Additional information

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

abrarsheikh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeffreywang88 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

cursor Bot Feb 28, 2026

Choose a reason for hiding this comment

Startup gang cleanup misses PENDING_MIGRATION siblings

jeffreywang88 Feb 28, 2026

Choose a reason for hiding this comment

abrarsheikh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeffreywang88 commented Mar 1, 2026

jeffreywang88 commented Mar 1, 2026

Uh oh!

Labels

2 participants

jeffreywang88 commented Feb 26, 2026 •

edited

Loading