Skip to content

[serve][3/n] Gang scheduling -- core scheduling engine#61206

Merged
abrarsheikh merged 17 commits into
masterfrom
gang-scheduling-part2-core
Mar 1, 2026
Merged

[serve][3/n] Gang scheduling -- core scheduling engine#61206
abrarsheikh merged 17 commits into
masterfrom
gang-scheduling-part2-core

Conversation

@jeffreywang88

Copy link
Copy Markdown
Contributor

Description

Implements the core gang scheduling logic: Gang-scheduled deployments atomically reserve placement groups for groups of replicas and start them together, ensuring all members of a gang are co-scheduled or none are.

Approach

  • Scheduler (deployment_scheduler.py)

    • Added schedule_gang_placement_groups to DeploymentScheduler.
    • The default scheduler now creates named gang placement groups and assigns replica ranks within each gang.
    • Replica scheduling checks for a gang placement group first, and falls back to per-replica placement if none exists.
    • Gang reservation results are passed to the deployment state machine.
  • State Machine (deployment_state.py)

    • Introduced a new step in the update loop to reserve gang placement groups.
    • Added _add_replicas_with_gang_scheduling() to start replicas with gang context (gang_id, rank, world_size, member_replica_ids).
    • If any replica in a gang fails during startup, all replicas in that gang are stopped.
    • Gracefully handles placement group removal failures for shared gang placement groups.
  • Replica (replica.py)

    • Extended ReplicaMetadata to include GangContext.
    • ActorReplicaWrapper now stores and exposes gang context and passes it through check_ready().

Related issues

RFC: #60873
Precedent: #61205
Original PR: #60802

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@jeffreywang88 jeffreywang88 requested a review from a team as a code owner February 20, 2026 19:25
@jeffreywang88 jeffreywang88 changed the title [serve][3/n] Gang scheduling - core scheduling engine Feb 20, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the core logic for gang scheduling in Ray Serve, a significant feature that enables atomic scheduling of replica groups. The changes span across the deployment scheduler, state machine, and replica logic, and are well-supported by a comprehensive suite of new tests.

The implementation correctly introduces a new step for reserving placement groups, updates the scheduling logic to utilize them, and handles gang-level failures to ensure atomicity. The approach is robust and the code is generally of high quality.

I have one suggestion to improve error logging to prevent potential resource leaks from being silently ignored.

Comment thread python/ray/serve/_private/deployment_state.py Outdated
Comment thread python/ray/serve/_private/deployment_state.py
Comment thread python/ray/serve/_private/deployment_state.py Outdated
@ray-gardener ray-gardener Bot added the community-contribution Contributed by the community label Feb 21, 2026
@jeffreywang88 jeffreywang88 added serve Ray Serve Related Issue and removed community-contribution Contributed by the community labels Feb 21, 2026
@jeffreywang88 jeffreywang88 force-pushed the gang-scheduling-part2-core branch from 4a67c66 to a83f768 Compare February 24, 2026 05:41
@jeffreywang88 jeffreywang88 requested review from a team as code owners February 24, 2026 05:41
@jeffreywang88 jeffreywang88 force-pushed the gang-scheduling-part1-validation branch from adfcd37 to d983408 Compare February 24, 2026 05:42
@jeffreywang88 jeffreywang88 removed request for a team February 24, 2026 05:43
@jeffreywang88 jeffreywang88 force-pushed the gang-scheduling-part2-core branch from a83f768 to 3d92384 Compare February 26, 2026 02:59
@jeffreywang88 jeffreywang88 requested review from a team as code owners February 26, 2026 02:59
@jeffreywang88 jeffreywang88 changed the base branch from gang-scheduling-part1-validation to master February 26, 2026 03:00
@jeffreywang88 jeffreywang88 removed request for a team February 26, 2026 03:06
@jeffreywang88 jeffreywang88 force-pushed the gang-scheduling-part2-core branch from 3d92384 to 5678685 Compare February 26, 2026 03:11
@jeffreywang88 jeffreywang88 added the go add ONLY when ready to merge, run all tests label Feb 26, 2026

@abrarsheikh abrarsheikh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a few more integration tests, ignore if any of these already exist

  1. serve.delete() on a gang app, verify PGs are cleaned up?
  2. Multiple gang deployments in one app
  3. One replica in a gang fails during startup. Both replicas in that gang are stopped; no partial gang left running.
  4. Running gang replica fails health check or crashes. Whole gang is torn down and restarted
Comment thread python/ray/serve/_private/deployment_state.py Outdated
Comment thread python/ray/serve/_private/deployment_state.py Outdated
Comment thread python/ray/serve/_private/deployment_state.py Outdated
Comment thread python/ray/serve/_private/deployment_state.py Outdated
Comment thread python/ray/serve/_private/deployment_state.py Outdated
Comment thread python/ray/serve/tests/test_gang_scheduling.py
Comment thread python/ray/serve/tests/unit/test_deployment_state.py Outdated
Comment thread python/ray/serve/_private/deployment_state.py
Comment thread python/ray/serve/_private/deployment_scheduler.py Outdated
Comment thread python/ray/serve/_private/deployment_scheduler.py
@jeffreywang88 jeffreywang88 force-pushed the gang-scheduling-part2-core branch from 9548459 to 6310a02 Compare February 26, 2026 19:08
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
Comment thread python/ray/serve/_private/deployment_state.py
@jeffreywang88

jeffreywang88 commented Feb 26, 2026

Copy link
Copy Markdown
Contributor Author

@abrarsheikh I addressed your comments regarding to tests -- ready for another pass.

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
Comment thread python/ray/serve/_private/deployment_state.py Outdated
Comment thread python/ray/serve/_private/deployment_state.py
Comment thread python/ray/serve/_private/deployment_state.py
Comment thread python/ray/serve/tests/unit/common/mock_replica_actor_wrapper.py Outdated
Comment thread python/ray/serve/tests/unit/BUILD.bazel Outdated
Comment thread python/ray/serve/tests/unit/conftest.py
Comment thread python/ray/serve/tests/unit/test_deployment_scheduler.py Outdated
Comment thread python/ray/serve/tests/unit/test_deployment_scheduler.py Outdated
Comment thread python/ray/serve/tests/test_gang_scheduling.py
Comment thread python/ray/serve/tests/test_gang_scheduling.py Outdated
Comment thread python/ray/serve/_private/deployment_state.py Outdated
Comment thread python/ray/serve/_private/deployment_state.py Outdated
Comment thread python/ray/serve/_private/deployment_state.py
Comment thread python/ray/serve/_private/deployment_state.py
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
Comment thread python/ray/serve/_private/deployment_state.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Comment thread python/ray/serve/_private/deployment_state.py
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

# Forcefully stop siblings to avoid partial gangs
self._stop_replica(replica, graceful_stop=False)
else:
self._replicas.add(state, replica)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Startup gang cleanup misses PENDING_MIGRATION siblings

Medium Severity

The gang sibling cleanup in _check_startup_replicas iterates over {original_state, ReplicaState.RUNNING} but omits ReplicaState.PENDING_MIGRATION. This is inconsistent with the health-check gang cleanup in check_and_update_replicas, which correctly iterates over [ReplicaState.RUNNING, ReplicaState.PENDING_MIGRATION]. If a gang member succeeds startup and transitions to RUNNING, then gets migrated to PENDING_MIGRATION, and its sibling subsequently fails startup, the migrating sibling won't be stopped — leaving a partial gang running.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is resolved in #61216

@abrarsheikh abrarsheikh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits

Comment thread python/ray/serve/_private/deployment_scheduler.py
Comment thread python/ray/serve/_private/deployment_scheduler.py Outdated
Comment thread python/ray/serve/_private/deployment_state.py Outdated
Comment thread python/ray/serve/tests/unit/test_deployment_scheduler.py Outdated
Comment thread python/ray/serve/tests/unit/test_deployment_state.py
Comment thread python/ray/serve/tests/unit/test_deployment_state.py Outdated
Comment thread python/ray/serve/tests/BUILD.bazel Outdated
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang88

Copy link
Copy Markdown
Contributor Author

@abrarsheikh I addressed your comments except for #61206 (comment) and also adjusted startup failure counting logic for gangs.

Previously, in gang scheduling, each replica startup failure is counted towards the threshold, but I think counting failure per gang makes more sense. Startup failure occurs when there's an allocation (e.g. insufficient resources) or initialization (e.g. actor initialization) issue, and replicas in a gang could run into issues with the same root cause, and therefore the previous approach will inflate the failure count by gang_size.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang88

Copy link
Copy Markdown
Contributor Author

Addressed all comments :)

@abrarsheikh abrarsheikh merged commit b78afa9 into master Mar 1, 2026
6 checks passed
@abrarsheikh abrarsheikh deleted the gang-scheduling-part2-core branch March 1, 2026 20:59
kamil-kaczmarek pushed a commit that referenced this pull request Mar 3, 2026
## Description
Implements the core gang scheduling logic: Gang-scheduled deployments
atomically reserve placement groups for groups of replicas and start
them together, ensuring all members of a gang are co-scheduled or none
are.

### Approach
- Scheduler (`deployment_scheduler.py`)
  - Added `schedule_gang_placement_groups` to DeploymentScheduler.
- The default scheduler now creates named gang placement groups and
assigns replica ranks within each gang.
- Replica scheduling checks for a gang placement group first, and falls
back to per-replica placement if none exists.
  - Gang reservation results are passed to the deployment state machine.

- State Machine (`deployment_state.py`)
- Introduced a new step in the update loop to reserve gang placement
groups.
- Added `_add_replicas_with_gang_scheduling()` to start replicas with
gang context (gang_id, rank, world_size, member_replica_ids).
- If any replica in a gang fails during startup, all replicas in that
gang are stopped.
- Gracefully handles placement group removal failures for shared gang
placement groups.

- Replica (`replica.py`)
  - Extended `ReplicaMetadata` to include `GangContext`.
- ActorReplicaWrapper now stores and exposes gang context and passes it
through `check_ready()`.

## Related issues
RFC: #60873
Precedent: #61205
Original PR: #60802

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: abrar <abrar@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
abrarsheikh pushed a commit that referenced this pull request Mar 4, 2026
## Description
Adds fault tolerance for gang-scheduled deployments.
- Implement RESTART_GANG runtime failure policy.
- Exercise leaked gang placement group detection after controller
recovery.

### Approach
- Implement `RESTART_GANG` policy within health check handling
(`deployment_state.py`)
- Refactored the health check loop to track healthy and unhealthy
replicas separately.
- When RESTART_GANG is enabled and a replica fails its health check, all
replicas, including the unhealthy ones and their healthy siblings, in
the same gang are force-stopped so the entire gang can be rescheduled
together.
- Exercise leaked gang placement group detection
`_detect_and_remove_leaked_placement_groups`
  - Extended existing leak detection to support gang placement groups.
- A gang placement group is considered leaked only if no active actors
reference it. Placement groups with live actors are preserved to avoid
prematurely releasing resources.
- GCS PG query failures are handled gracefully by skipping the leaked
gang PG detection.

### Test Plan
#### Unit Tests

| Category | Test | Description |
|-----------|------|-------------|
| GangReservationResult fields | `TestScheduleGangPlacementGroups
::test_schedule_gang_placement_groups` | Calls real scheduler; asserts
length, uniqueness, and `GANG_PG_NAME_PREFIX` |
| GangReservationResult fields | `TestScaleDeploymentGangReplicas
::test_successful_gang_reservation` | Mocks result with `gang_ids` and
`gang_pg_names`; asserts `gang_context.pg_name` in `gang_pg_names` |
| Gang-aware `check_and_update_replicas` | `TestGangHealthCheck
::test_restart_gang_entire_gang_stopped` | Unhealthy replica → entire
owning gang force-stopped & healthy gangs unaffected |
| Gang-aware `check_and_update_replicas` | `TestGangHealthCheck
::test_restart_gang_force_stop_all_gang_replicas` | Unhealthy gang
replicas are force-stopped regardless of `FORCE_STOP_UNHEALTHY_REPLICAS`
|
| Gang-aware `check_and_update_replicas` | `TestGangHealthCheck
::test_restart_gang_multiple_unhealthy_gang_replicas` | Multiple
unhealthy replicas in same gang; verifies deduplication |
| Gang-aware `check_and_update_replicas` | `TestGangHealthCheck
::test_restart_gang_multiple_gangs_failing` | Multiple gangs with
unhealthy replicas are all stopped; verifies set accumulation |


#### Integration Tests

| Test | Description |
|-------|----------|
| `test_gang_health_check_restarts_gang` | Health check failure ->
entire gang is torn down while surviving gang continues serving traffic
with zero downtime -> deployment recovers to HEALTHY and both failed
replicas are replaced |
| `test_leaked_gang_pg_removed_on_controller_recovery` | Kill replicas
on a gang PG -> restart controller -> leaked gang PG is detected and
removed -> zero downtime throughout |
| `TestGangControllerRecovery::test_gang_context_recovery` | Coexisting
gang and non-gang deployments -> kill the controller -> GangContext and
ReplicaContext are recovered -> apps / deployments return to RUNNING /
HEALTHY state |
| `TestGangPGLeakDetection ::test_gcs_failure_skip_pg_leak_detection` |
GCS query failure -> cleanup skipped |

#### Learnings from preceding PR
- Integration tests now assert both deployment and app statuses
- DeploymentScheduler tests now proceed the state machine to ensure that
deployment returns to HEALTHY

## Related issues
RFC: #60873
Precedent: #61206
Original PR: #60802

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

2 participants