Skip to content

[serve][2/n] Gang scheduling -- validation and utilities#61205

Merged
abrarsheikh merged 11 commits into
ray-project:masterfrom
jeffreywang88:gang-scheduling-part1-validation
Feb 26, 2026
Merged

[serve][2/n] Gang scheduling -- validation and utilities#61205
abrarsheikh merged 11 commits into
ray-project:masterfrom
jeffreywang88:gang-scheduling-part1-validation

Conversation

@jeffreywang88

@jeffreywang88 jeffreywang88 commented Feb 20, 2026

Copy link
Copy Markdown
Contributor

Description

This PR lays the groundwork for gang scheduling in Ray Serve by extending the data model, adding config validation rules, and introducing a utility for querying active placement group IDs from GCS.

Test plan

  • test_schema.py: invalid gang_size, mutual exclusivity with max_replicas_per_node and placement_group_strategy
  • test_api.py: @serve.deployment rejects gang config combined with max_replicas_per_node / placement_group_strategy
  • test_application_state.py: override gang config via .options(), reject invalid gang size multiple
  • test_util.py: verifies the GCS query utility

Related issues

RFC: #60873
Original PR: #60802

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@jeffreywang88 jeffreywang88 requested a review from a team as a code owner February 20, 2026 19:12
@ray-gardener ray-gardener Bot added the community-contribution Contributed by the community label Feb 20, 2026
Comment thread python/ray/serve/_private/utils.py Outdated
Comment thread python/ray/serve/_private/common.py

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request lays the groundwork for gang scheduling in Ray Serve by extending the data model and adding essential validation rules. The enforcement of mutual exclusivity between gang scheduling and conflicting placement group strategies is well-implemented. However, there is a critical bug in the get_active_placement_group_ids utility function where it attempts to use HasField on a scalar bytes field, which is not supported in proto3 and will cause a crash. Additionally, the utility should filter out Nil IDs to accurately identify active placement groups. There is also a minor validation gap in the deployment decorator regarding replica multiples.

Comment thread python/ray/serve/_private/utils.py Outdated
@jeffreywang88 jeffreywang88 changed the title [serve][2/n] Gang scheduling: validation and utilities Feb 20, 2026
@jeffreywang88 jeffreywang88 added go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue and removed community-contribution Contributed by the community labels Feb 20, 2026
Comment thread python/ray/serve/deployment.py
@jeffreywang88 jeffreywang88 force-pushed the gang-scheduling-part1-validation branch from 45ead0a to d983408 Compare February 24, 2026 05:09
Comment thread python/ray/serve/tests/test_util.py
Comment thread python/ray/serve/deployment.py
return values

@root_validator
def validate_placement_group_strategy_and_gang_scheduling_config(cls, values):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am assuming this will be added later?

@jeffreywang88 jeffreywang88 Feb 24, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no, we'd prefer setting the placement strategy through GangSchedulingConfig.gang_placement_strategy.

return live_pg_names


def get_active_placement_group_ids() -> Set[str]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to learn more about why we need this. We need to figure out a way around this, one it uses private API from serve and this is disallowed at project level. (b) it is making RPC call to GSC from what it looks like, so we cannot be calling this from every controller iteration loop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great callout, I haven't thought about the GCS aspect. Luckily, we will not call this in every controller loop. Starting from #61207, get_active_placement_group_ids will be invoked in _detect_and_remove_leaked_placement_groups only upon controller recovery, so I guess we don't need worry about the performance aspect.

It'll look something like this:

        gang_pg_names_in_cluster = [
            name
            for name in all_current_placement_group_names
            if name.startswith(GANG_PG_NAME_PREFIX)
        ]
        if gang_pg_names_in_cluster:
            pg_table = ray.util.placement_group_table()
            gang_pg_name_to_id: Dict[str, str] = {}
            for pg_id_hex, entry in pg_table.items():
                name = entry.get("name", "")
                if name.startswith(GANG_PG_NAME_PREFIX):
                    gang_pg_name_to_id[name] = pg_id_hex

            try:
                occupied_pg_ids = get_active_placement_group_ids()
            except Exception:
                logger.warning(
                    "Skipping gang PG leak detection due to GCS query failure.",
                    exc_info=True,
                )
            else:
                for gang_pg_name in gang_pg_names_in_cluster:
                    pg_id = gang_pg_name_to_id.get(gang_pg_name)
                    if pg_id is not None and pg_id not in occupied_pg_ids:
                        leaked_pg_names.append(gang_pg_name)

@jeffreywang88 jeffreywang88 Feb 24, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which private APIs from serve are you referring to?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from ray._private.state import state

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, let me fix this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit. Also added logic to only detect pgs reference by serve actors to avoid from deleting non-serve PGs. (The deletion logic will be introduced in a follow-up PR.)

@abrarsheikh

Copy link
Copy Markdown
Contributor

please merge master

…rt per-replica PG

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
@jeffreywang88 jeffreywang88 force-pushed the gang-scheduling-part1-validation branch from d983408 to 38191f1 Compare February 24, 2026 18:32
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
Comment thread python/ray/serve/_private/utils.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

"gang_scheduling_config is provided."
),
):
DeploymentSchema.parse_obj(deployment_schema)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema test missing num_replicas causes wrong validation error

Medium Severity

Both test_mutually_exclusive_max_replicas_per_node_and_gang_scheduling_config and test_mutually_exclusive_placement_group_strategy_and_gang_scheduling_config set gang_size=2 without setting num_replicas. If get_minimal_deployment_schema() returns a schema with default num_replicas=1, then 1 % 2 != 0 triggers the "num_replicas must be a multiple of gang_size" root validator before the mutual-exclusivity validators, causing both tests to fail with an unexpected error message. Compare with the nearby test_gang_scheduling_config_invalid_num_replicas which correctly sets num_replicas=4.

Additional Locations (1)

Fix in Cursor Fix in Web

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
@jeffreywang88

jeffreywang88 commented Feb 25, 2026

Copy link
Copy Markdown
Contributor Author

In utils.py, importing from ray.util.state at the top level causes the linter to discover and scan ray.util.state.common when it traverses ray.serve, and there are a lot of APIs under ray.util.state.common that violate API annotation linter.

Therefore, I moved the imports inside get_active_placement_group_ids. This helper is only called upon controller recovery, so it should be alright + adding API annotations for classes under ray.util.state.common is out of scope for this PR.

Here's the buildkite failure: https://buildkite.com/ray-project/premerge/builds/60808/steps/canvas?sid=019c9261-214a-4c16-b979-950cf4674171&tab=output.

Comment on lines +533 to +534
from ray.util.state import list_actors
from ray.util.state.common import RAY_MAX_LIMIT_FROM_API_SERVER

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move imports to top

Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh

Copy link
Copy Markdown
Contributor

In utils.py, importing from ray.util.state at the top level causes the linter to discover and scan ray.util.state.common when it traverses ray.serve, and there are a lot of APIs under ray.util.state.common that violate API annotation linter.

why does docs api annotation discover ray.util.state?

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
@jeffreywang88

Copy link
Copy Markdown
Contributor Author

Ray Train adopts the same pattern of importing ray.util.state.list_actors. The reason why the linter doesn't complain about that is:

  • import ray.train (train.__init__.py) does NOT transitively load ray.train.v2._internal.state.util, so ray.util.state never enters sys.modules. The linter never discovers it.
  • import ray.serve DOES transitively load ray.serve._private.utils (code), which has the top-level from ray.util.state import list_actors. This pulls ray.util.state into sys.modules. Then when the linter scans ray.util, it finds ray.util.state and flags all its unannotated symbols.
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
@abrarsheikh abrarsheikh merged commit 2745feb into ray-project:master Feb 26, 2026
6 checks passed
abrarsheikh added a commit that referenced this pull request Mar 1, 2026
## Description
Implements the core gang scheduling logic: Gang-scheduled deployments
atomically reserve placement groups for groups of replicas and start
them together, ensuring all members of a gang are co-scheduled or none
are.

### Approach
- Scheduler (`deployment_scheduler.py`)
  - Added `schedule_gang_placement_groups` to DeploymentScheduler.
- The default scheduler now creates named gang placement groups and
assigns replica ranks within each gang.
- Replica scheduling checks for a gang placement group first, and falls
back to per-replica placement if none exists.
  - Gang reservation results are passed to the deployment state machine.

- State Machine (`deployment_state.py`)
- Introduced a new step in the update loop to reserve gang placement
groups.
- Added `_add_replicas_with_gang_scheduling()` to start replicas with
gang context (gang_id, rank, world_size, member_replica_ids).
- If any replica in a gang fails during startup, all replicas in that
gang are stopped.
- Gracefully handles placement group removal failures for shared gang
placement groups.

- Replica (`replica.py`)
  - Extended `ReplicaMetadata` to include `GangContext`.
- ActorReplicaWrapper now stores and exposes gang context and passes it
through `check_ready()`.

## Related issues
RFC: #60873
Precedent: #61205
Original PR: #60802

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: abrar <abrar@anyscale.com>
kamil-kaczmarek pushed a commit that referenced this pull request Mar 3, 2026
## Description
Implements the core gang scheduling logic: Gang-scheduled deployments
atomically reserve placement groups for groups of replicas and start
them together, ensuring all members of a gang are co-scheduled or none
are.

### Approach
- Scheduler (`deployment_scheduler.py`)
  - Added `schedule_gang_placement_groups` to DeploymentScheduler.
- The default scheduler now creates named gang placement groups and
assigns replica ranks within each gang.
- Replica scheduling checks for a gang placement group first, and falls
back to per-replica placement if none exists.
  - Gang reservation results are passed to the deployment state machine.

- State Machine (`deployment_state.py`)
- Introduced a new step in the update loop to reserve gang placement
groups.
- Added `_add_replicas_with_gang_scheduling()` to start replicas with
gang context (gang_id, rank, world_size, member_replica_ids).
- If any replica in a gang fails during startup, all replicas in that
gang are stopped.
- Gracefully handles placement group removal failures for shared gang
placement groups.

- Replica (`replica.py`)
  - Extended `ReplicaMetadata` to include `GangContext`.
- ActorReplicaWrapper now stores and exposes gang context and passes it
through `check_ready()`.

## Related issues
RFC: #60873
Precedent: #61205
Original PR: #60802

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: abrar <abrar@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

2 participants