[serve][2/n] Gang scheduling -- validation and utilities#61205
Conversation
There was a problem hiding this comment.
Code Review
This pull request lays the groundwork for gang scheduling in Ray Serve by extending the data model and adding essential validation rules. The enforcement of mutual exclusivity between gang scheduling and conflicting placement group strategies is well-implemented. However, there is a critical bug in the get_active_placement_group_ids utility function where it attempts to use HasField on a scalar bytes field, which is not supported in proto3 and will cause a crash. Additionally, the utility should filter out Nil IDs to accurately identify active placement groups. There is also a minor validation gap in the deployment decorator regarding replica multiples.
45ead0a to
d983408
Compare
| return values | ||
|
|
||
| @root_validator | ||
| def validate_placement_group_strategy_and_gang_scheduling_config(cls, values): |
There was a problem hiding this comment.
i am assuming this will be added later?
There was a problem hiding this comment.
Actually no, we'd prefer setting the placement strategy through GangSchedulingConfig.gang_placement_strategy.
| return live_pg_names | ||
|
|
||
|
|
||
| def get_active_placement_group_ids() -> Set[str]: |
There was a problem hiding this comment.
need to learn more about why we need this. We need to figure out a way around this, one it uses private API from serve and this is disallowed at project level. (b) it is making RPC call to GSC from what it looks like, so we cannot be calling this from every controller iteration loop.
There was a problem hiding this comment.
Great callout, I haven't thought about the GCS aspect. Luckily, we will not call this in every controller loop. Starting from #61207, get_active_placement_group_ids will be invoked in _detect_and_remove_leaked_placement_groups only upon controller recovery, so I guess we don't need worry about the performance aspect.
It'll look something like this:
gang_pg_names_in_cluster = [
name
for name in all_current_placement_group_names
if name.startswith(GANG_PG_NAME_PREFIX)
]
if gang_pg_names_in_cluster:
pg_table = ray.util.placement_group_table()
gang_pg_name_to_id: Dict[str, str] = {}
for pg_id_hex, entry in pg_table.items():
name = entry.get("name", "")
if name.startswith(GANG_PG_NAME_PREFIX):
gang_pg_name_to_id[name] = pg_id_hex
try:
occupied_pg_ids = get_active_placement_group_ids()
except Exception:
logger.warning(
"Skipping gang PG leak detection due to GCS query failure.",
exc_info=True,
)
else:
for gang_pg_name in gang_pg_names_in_cluster:
pg_id = gang_pg_name_to_id.get(gang_pg_name)
if pg_id is not None and pg_id not in occupied_pg_ids:
leaked_pg_names.append(gang_pg_name)
There was a problem hiding this comment.
Which private APIs from serve are you referring to?
There was a problem hiding this comment.
from ray._private.state import state
There was a problem hiding this comment.
gotcha, let me fix this
There was a problem hiding this comment.
Addressed in the latest commit. Also added logic to only detect pgs reference by serve actors to avoid from deleting non-serve PGs. (The deletion logic will be introduced in a follow-up PR.)
|
please merge master |
…rt per-replica PG Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
d983408 to
38191f1
Compare
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
| "gang_scheduling_config is provided." | ||
| ), | ||
| ): | ||
| DeploymentSchema.parse_obj(deployment_schema) |
There was a problem hiding this comment.
Schema test missing num_replicas causes wrong validation error
Medium Severity
Both test_mutually_exclusive_max_replicas_per_node_and_gang_scheduling_config and test_mutually_exclusive_placement_group_strategy_and_gang_scheduling_config set gang_size=2 without setting num_replicas. If get_minimal_deployment_schema() returns a schema with default num_replicas=1, then 1 % 2 != 0 triggers the "num_replicas must be a multiple of gang_size" root validator before the mutual-exclusivity validators, causing both tests to fail with an unexpected error message. Compare with the nearby test_gang_scheduling_config_invalid_num_replicas which correctly sets num_replicas=4.
Additional Locations (1)
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
|
In utils.py, importing from Therefore, I moved the imports inside Here's the buildkite failure: https://buildkite.com/ray-project/premerge/builds/60808/steps/canvas?sid=019c9261-214a-4c16-b979-950cf4674171&tab=output. |
| from ray.util.state import list_actors | ||
| from ray.util.state.common import RAY_MAX_LIMIT_FROM_API_SERVER |
Signed-off-by: abrar <abrar@anyscale.com>
why does docs api annotation discover |
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
|
Ray Train adopts the same pattern of importing
|
Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com>
## Description Implements the core gang scheduling logic: Gang-scheduled deployments atomically reserve placement groups for groups of replicas and start them together, ensuring all members of a gang are co-scheduled or none are. ### Approach - Scheduler (`deployment_scheduler.py`) - Added `schedule_gang_placement_groups` to DeploymentScheduler. - The default scheduler now creates named gang placement groups and assigns replica ranks within each gang. - Replica scheduling checks for a gang placement group first, and falls back to per-replica placement if none exists. - Gang reservation results are passed to the deployment state machine. - State Machine (`deployment_state.py`) - Introduced a new step in the update loop to reserve gang placement groups. - Added `_add_replicas_with_gang_scheduling()` to start replicas with gang context (gang_id, rank, world_size, member_replica_ids). - If any replica in a gang fails during startup, all replicas in that gang are stopped. - Gracefully handles placement group removal failures for shared gang placement groups. - Replica (`replica.py`) - Extended `ReplicaMetadata` to include `GangContext`. - ActorReplicaWrapper now stores and exposes gang context and passes it through `check_ready()`. ## Related issues RFC: #60873 Precedent: #61205 Original PR: #60802 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: jeffreywang <jeffreywang@anyscale.com> Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: abrar <abrar@anyscale.com>
## Description Implements the core gang scheduling logic: Gang-scheduled deployments atomically reserve placement groups for groups of replicas and start them together, ensuring all members of a gang are co-scheduled or none are. ### Approach - Scheduler (`deployment_scheduler.py`) - Added `schedule_gang_placement_groups` to DeploymentScheduler. - The default scheduler now creates named gang placement groups and assigns replica ranks within each gang. - Replica scheduling checks for a gang placement group first, and falls back to per-replica placement if none exists. - Gang reservation results are passed to the deployment state machine. - State Machine (`deployment_state.py`) - Introduced a new step in the update loop to reserve gang placement groups. - Added `_add_replicas_with_gang_scheduling()` to start replicas with gang context (gang_id, rank, world_size, member_replica_ids). - If any replica in a gang fails during startup, all replicas in that gang are stopped. - Gracefully handles placement group removal failures for shared gang placement groups. - Replica (`replica.py`) - Extended `ReplicaMetadata` to include `GangContext`. - ActorReplicaWrapper now stores and exposes gang context and passes it through `check_ready()`. ## Related issues RFC: #60873 Precedent: #61205 Original PR: #60802 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: jeffreywang <jeffreywang@anyscale.com> Signed-off-by: jeffreywang-anyscale <jeffreywang@anyscale.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: abrar <abrar@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


Description
This PR lays the groundwork for gang scheduling in Ray Serve by extending the data model, adding config validation rules, and introducing a utility for querying active placement group IDs from GCS.
Test plan
@serve.deploymentrejects gang config combined with max_replicas_per_node / placement_group_strategyRelated issues
RFC: #60873
Original PR: #60802
Additional information