[serve][2/n] Gang scheduling -- validation and utilities by jeffreywang88 · Pull Request #61205 · ray-project/ray

jeffreywang88 · 2026-02-20T19:12:38Z

Description

This PR lays the groundwork for gang scheduling in Ray Serve by extending the data model, adding config validation rules, and introducing a utility for querying active placement group IDs from GCS.

Test plan

test_schema.py: invalid gang_size, mutual exclusivity with max_replicas_per_node and placement_group_strategy
test_api.py: @serve.deployment rejects gang config combined with max_replicas_per_node / placement_group_strategy
test_application_state.py: override gang config via .options(), reject invalid gang size multiple
test_util.py: verifies the GCS query utility

Related issues

RFC: #60873
Original PR: #60802

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

gemini-code-assist

Code Review

This pull request lays the groundwork for gang scheduling in Ray Serve by extending the data model and adding essential validation rules. The enforcement of mutual exclusivity between gang scheduling and conflicting placement group strategies is well-implemented. However, there is a critical bug in the get_active_placement_group_ids utility function where it attempts to use HasField on a scalar bytes field, which is not supported in proto3 and will cause a crash. Additionally, the utility should filter out Nil IDs to accurately identify active placement groups. There is also a minor validation gap in the deployment decorator regarding replica multiples.

abrarsheikh · 2026-02-24T05:57:14Z

+        return values
+
+    @root_validator
+    def validate_placement_group_strategy_and_gang_scheduling_config(cls, values):


i am assuming this will be added later?

Actually no, we'd prefer setting the placement strategy through GangSchedulingConfig.gang_placement_strategy.

abrarsheikh · 2026-02-24T06:08:57Z

    return live_pg_names


+def get_active_placement_group_ids() -> Set[str]:


need to learn more about why we need this. We need to figure out a way around this, one it uses private API from serve and this is disallowed at project level. (b) it is making RPC call to GSC from what it looks like, so we cannot be calling this from every controller iteration loop.

Great callout, I haven't thought about the GCS aspect. Luckily, we will not call this in every controller loop. Starting from #61207, get_active_placement_group_ids will be invoked in _detect_and_remove_leaked_placement_groups only upon controller recovery, so I guess we don't need worry about the performance aspect.

It'll look something like this:

gang_pg_names_in_cluster = [ name for name in all_current_placement_group_names if name.startswith(GANG_PG_NAME_PREFIX) ] if gang_pg_names_in_cluster: pg_table = ray.util.placement_group_table() gang_pg_name_to_id: Dict[str, str] = {} for pg_id_hex, entry in pg_table.items(): name = entry.get("name", "") if name.startswith(GANG_PG_NAME_PREFIX): gang_pg_name_to_id[name] = pg_id_hex try: occupied_pg_ids = get_active_placement_group_ids() except Exception: logger.warning( "Skipping gang PG leak detection due to GCS query failure.", exc_info=True, ) else: for gang_pg_name in gang_pg_names_in_cluster: pg_id = gang_pg_name_to_id.get(gang_pg_name) if pg_id is not None and pg_id not in occupied_pg_ids: leaked_pg_names.append(gang_pg_name)

Which private APIs from serve are you referring to?

from ray._private.state import state

gotcha, let me fix this

Addressed in the latest commit. Also added logic to only detect pgs reference by serve actors to avoid from deleting non-serve PGs. (The deletion logic will be introduced in a follow-up PR.)

abrarsheikh · 2026-02-24T06:09:39Z

please merge master

…rt per-replica PG Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-02-25T01:30:26Z

+                "gang_scheduling_config is provided."
+            ),
+        ):
+            DeploymentSchema.parse_obj(deployment_schema)


Schema test missing num_replicas causes wrong validation error

Medium Severity

Both test_mutually_exclusive_max_replicas_per_node_and_gang_scheduling_config and test_mutually_exclusive_placement_group_strategy_and_gang_scheduling_config set gang_size=2 without setting num_replicas. If get_minimal_deployment_schema() returns a schema with default num_replicas=1, then 1 % 2 != 0 triggers the "num_replicas must be a multiple of gang_size" root validator before the mutual-exclusivity validators, causing both tests to fail with an unexpected error message. Compare with the nearby test_gang_scheduling_config_invalid_num_replicas which correctly sets num_replicas=4.

Additional Locations (1)

python/ray/serve/tests/unit/test_schema.py#L497-L511

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

jeffreywang88 · 2026-02-25T02:09:09Z

In utils.py, importing from ray.util.state at the top level causes the linter to discover and scan ray.util.state.common when it traverses ray.serve, and there are a lot of APIs under ray.util.state.common that violate API annotation linter.

Therefore, I moved the imports inside get_active_placement_group_ids. This helper is only called upon controller recovery, so it should be alright + adding API annotations for classes under ray.util.state.common is out of scope for this PR.

Here's the buildkite failure: https://buildkite.com/ray-project/premerge/builds/60808/steps/canvas?sid=019c9261-214a-4c16-b979-950cf4674171&tab=output.

abrarsheikh · 2026-02-25T03:12:10Z

+    from ray.util.state import list_actors
+    from ray.util.state.common import RAY_MAX_LIMIT_FROM_API_SERVER


move imports to top

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh · 2026-02-25T05:30:22Z

In utils.py, importing from ray.util.state at the top level causes the linter to discover and scan ray.util.state.common when it traverses ray.serve, and there are a lot of APIs under ray.util.state.common that violate API annotation linter.

why does docs api annotation discover ray.util.state?

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

jeffreywang88 · 2026-02-25T20:20:03Z

Ray Train adopts the same pattern of importing ray.util.state.list_actors. The reason why the linter doesn't complain about that is:

import ray.train (train.__init__.py) does NOT transitively load ray.train.v2._internal.state.util, so ray.util.state never enters sys.modules. The linter never discovers it.
import ray.serve DOES transitively load ray.serve._private.utils (code), which has the top-level from ray.util.state import list_actors. This pulls ray.util.state into sys.modules. Then when the linter scans ray.util, it finds ray.util.state and flags all its unannotated symbols.

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

## Description Implements the core gang scheduling logic: Gang-scheduled deployments atomically reserve placement groups for groups of replicas and start them together, ensuring all members of a gang are co-scheduled or none are. ### Approach - Scheduler (`deployment_scheduler.py`) - Added `schedule_gang_placement_groups` to DeploymentScheduler. - The default scheduler now creates named gang placement groups and assigns replica ranks within each gang. - Replica scheduling checks for a gang placement group first, and falls back to per-replica placement if none exists. - Gang reservation results are passed to the deployment state machine. - State Machine (`deployment_state.py`) - Introduced a new step in the update loop to reserve gang placement groups. - Added `_add_replicas_with_gang_scheduling()` to start replicas with gang context (gang_id, rank, world_size, member_replica_ids). - If any replica in a gang fails during startup, all replicas in that gang are stopped. - Gracefully handles placement group removal failures for shared gang placement groups. - Replica (`replica.py`) - Extended `ReplicaMetadata` to include `GangContext`. - ActorReplicaWrapper now stores and exposes gang context and passes it through `check_ready()`. ## Related issues RFC: #60873 Precedent: #61205 Original PR: #60802 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: jeffreywang <jeffreywang@anyscale.com> Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: abrar <abrar@anyscale.com>

## Description Implements the core gang scheduling logic: Gang-scheduled deployments atomically reserve placement groups for groups of replicas and start them together, ensuring all members of a gang are co-scheduled or none are. ### Approach - Scheduler (`deployment_scheduler.py`) - Added `schedule_gang_placement_groups` to DeploymentScheduler. - The default scheduler now creates named gang placement groups and assigns replica ranks within each gang. - Replica scheduling checks for a gang placement group first, and falls back to per-replica placement if none exists. - Gang reservation results are passed to the deployment state machine. - State Machine (`deployment_state.py`) - Introduced a new step in the update loop to reserve gang placement groups. - Added `_add_replicas_with_gang_scheduling()` to start replicas with gang context (gang_id, rank, world_size, member_replica_ids). - If any replica in a gang fails during startup, all replicas in that gang are stopped. - Gracefully handles placement group removal failures for shared gang placement groups. - Replica (`replica.py`) - Extended `ReplicaMetadata` to include `GangContext`. - ActorReplicaWrapper now stores and exposes gang context and passes it through `check_ready()`. ## Related issues RFC: #60873 Precedent: #61205 Original PR: #60802 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: jeffreywang <jeffreywang@anyscale.com> Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: abrar <abrar@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

jeffreywang88 requested a review from a team as a code owner February 20, 2026 19:12

ray-gardener Bot added the community-contribution Contributed by the community label Feb 20, 2026

cursor Bot reviewed Feb 20, 2026

View reviewed changes

Comment thread python/ray/serve/_private/utils.py Outdated

Comment thread python/ray/serve/_private/common.py

gemini-code-assist Bot reviewed Feb 20, 2026

View reviewed changes

Comment thread python/ray/serve/_private/utils.py Outdated

jeffreywang88 mentioned this pull request Feb 20, 2026

[serve][3/n] Gang scheduling -- core scheduling engine #61206

Merged

jeffreywang88 changed the title ~~[serve][2/n] Gang scheduling: validation and utilities~~ Feb 20, 2026

jeffreywang88 requested a review from abrarsheikh February 20, 2026 22:28

jeffreywang88 added go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue and removed community-contribution Contributed by the community labels Feb 20, 2026

cursor Bot reviewed Feb 20, 2026

View reviewed changes

Comment thread python/ray/serve/deployment.py

jeffreywang88 force-pushed the gang-scheduling-part1-validation branch from 45ead0a to d983408 Compare February 24, 2026 05:09

cursor Bot reviewed Feb 24, 2026

View reviewed changes

Comment thread python/ray/serve/tests/test_util.py

abrarsheikh approved these changes Feb 24, 2026

View reviewed changes

jeffreywang88 added 5 commits February 24, 2026 17:55

Add GANG_PG_NAME_PREFIX and extend GangPlacementGroupRequest to suppo…

ff4411a

…rt per-replica PG Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

Add validation rules for gang scheduling config mutual exclusivity

bed4227

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

Add get_active_placement_group_ids utility

62758f3

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

Fix black

524cbde

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

CR feedback

38191f1

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

jeffreywang88 force-pushed the gang-scheduling-part1-validation branch from d983408 to 38191f1 Compare February 24, 2026 18:32

Move away from _private APIs

2256479

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

cursor Bot reviewed Feb 25, 2026

View reviewed changes

Comment thread python/ray/serve/_private/utils.py

Merge branch 'master' into gang-scheduling-part1-validation

352194e

cursor Bot reviewed Feb 25, 2026

View reviewed changes

Fix unhappy api_annotations linter

2b5c43a

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

abrarsheikh approved these changes Feb 25, 2026

View reviewed changes

move import to top

04b0ad2

Signed-off-by: abrar <abrar@anyscale.com>

Fix linter

7e2aa35

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

jeffreywang88 mentioned this pull request Feb 25, 2026

[core] Missing API annotations for APIs under public modules (ray.util, ray.dashboard) #61330

Open

Fix linter

4a71ece

Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>

abrarsheikh approved these changes Feb 26, 2026

View reviewed changes

abrarsheikh merged commit 2745feb into ray-project:master Feb 26, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[serve][2/n] Gang scheduling -- validation and utilities#61205

[serve][2/n] Gang scheduling -- validation and utilities#61205
abrarsheikh merged 11 commits into
ray-project:masterfrom
jeffreywang88:gang-scheduling-part1-validation

jeffreywang88 commented Feb 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abrarsheikh Feb 24, 2026

jeffreywang88 Feb 24, 2026 •

edited

Loading

abrarsheikh Feb 24, 2026

jeffreywang88 Feb 24, 2026

jeffreywang88 Feb 24, 2026 •

edited

Loading

abrarsheikh Feb 24, 2026

jeffreywang88 Feb 24, 2026

jeffreywang88 Feb 25, 2026

abrarsheikh commented Feb 24, 2026

Uh oh!

cursor Bot left a comment

cursor Bot Feb 25, 2026

jeffreywang88 commented Feb 25, 2026 •

edited

Loading

abrarsheikh Feb 25, 2026

abrarsheikh commented Feb 25, 2026

jeffreywang88 commented Feb 25, 2026

Uh oh!

Labels

2 participants

		return live_pg_names


		def get_active_placement_group_ids() -> Set[str]:

		from ray.util.state import list_actors
		from ray.util.state.common import RAY_MAX_LIMIT_FROM_API_SERVER

Uh oh!

Conversation

jeffreywang88 commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test plan

Related issues

Additional information

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abrarsheikh Feb 24, 2026

Choose a reason for hiding this comment

jeffreywang88 Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

abrarsheikh Feb 24, 2026

Choose a reason for hiding this comment

jeffreywang88 Feb 24, 2026

Choose a reason for hiding this comment

jeffreywang88 Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

abrarsheikh Feb 24, 2026

Choose a reason for hiding this comment

jeffreywang88 Feb 24, 2026

Choose a reason for hiding this comment

jeffreywang88 Feb 25, 2026

Choose a reason for hiding this comment

abrarsheikh commented Feb 24, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

cursor Bot Feb 25, 2026

Choose a reason for hiding this comment

Schema test missing num_replicas causes wrong validation error

jeffreywang88 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

abrarsheikh Feb 25, 2026

Choose a reason for hiding this comment

abrarsheikh commented Feb 25, 2026

jeffreywang88 commented Feb 25, 2026

Uh oh!

Labels

2 participants

jeffreywang88 commented Feb 20, 2026 •

edited

Loading

jeffreywang88 Feb 24, 2026 •

edited

Loading

jeffreywang88 Feb 24, 2026 •

edited

Loading

Schema test missing `num_replicas` causes wrong validation error

jeffreywang88 commented Feb 25, 2026 •

edited

Loading