Skip to content

[Serve] feat: scale down non matching primary labels replica first#61488

Merged
abrarsheikh merged 24 commits into
ray-project:masterfrom
manhld0206:feat/prioritize_scaling_down_fallback_labels_replica
Apr 3, 2026
Merged

[Serve] feat: scale down non matching primary labels replica first#61488
abrarsheikh merged 24 commits into
ray-project:masterfrom
manhld0206:feat/prioritize_scaling_down_fallback_labels_replica

Conversation

@manhld0206

@manhld0206 manhld0206 commented Mar 4, 2026

Copy link
Copy Markdown
Contributor

Fixes #58959

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a change to the deployment scale-down logic to prioritize stopping replicas on nodes that do not match the deployment's primary label_selector. This is achieved by modifying the sorting key in _get_replicas_to_stop. While the overall approach is correct, I found a critical issue in the implementation of the sorting logic that inverts the intended priority. My review includes a specific comment with a suggested fix for this issue.

Comment thread python/ray/serve/_private/deployment_scheduler.py Outdated

@abrarsheikh abrarsheikh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like the approach overall, it's reasonable. Please add e2e tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function has grown a bit complicated, please update the docstring explain the algorithm

Comment on lines 1121 to 1123
match_primary_labels = int(
node_labels_match_selector(node_labels, primary_labels)
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should also take into account placement group label selector.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think we can check node_labels against self._deployments[deployment_id].placement_group_bundle_label_selector if it's set, label_selector is only for the deployment Actor while placement_group_bundle_label_selector is applied to each of the bundles that dictate where the replicas are placed. If the node labels don't match any bundles/bundle_label_selector in placement_group_bundle_label_selector, then it's infeasible for scheduling and should be scaled down first.

@abrarsheikh

Copy link
Copy Markdown
Contributor

please also fix DCO

@@ -1089,6 +1090,8 @@ def _get_replicas_to_stop(
if len(replicas_to_stop) == max_num_to_stop:
return replicas_to_stop

actor_options = self._deployments[deployment_id].actor_options or {}
primary_labels: Dict[str, str] = actor_options.get("label_selector", {})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do primary_label_selector or similar, I think an Actor can also have labels so it might get confusing.

@manhld0206 manhld0206 changed the title feat: scale down non matching primary labels first Mar 5, 2026
@manhld0206 manhld0206 force-pushed the feat/prioritize_scaling_down_fallback_labels_replica branch from 722d7af to aae4e32 Compare March 5, 2026 10:14
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
@manhld0206 manhld0206 force-pushed the feat/prioritize_scaling_down_fallback_labels_replica branch from aae4e32 to 62cab53 Compare March 5, 2026 10:15
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
@manhld0206 manhld0206 marked this pull request as ready for review March 12, 2026 14:22
@manhld0206 manhld0206 requested a review from a team as a code owner March 12, 2026 14:22
@manhld0206 manhld0206 requested a review from abrarsheikh March 12, 2026 14:25
Comment thread python/ray/serve/tests/test_deployment_scheduler.py Outdated
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
Comment thread python/ray/serve/_private/deployment_scheduler.py Outdated
Comment thread python/ray/serve/tests/test_deployment_scheduler.py Outdated
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels Mar 12, 2026
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
@manhld0206

Copy link
Copy Markdown
Contributor Author

@abrarsheikh The PR is ready for review. Could you take a look?

@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Mar 18, 2026
Comment thread python/ray/serve/_private/deployment_scheduler.py Outdated
assert dep_id in deployment_to_replicas_to_stop
return deployment_to_replicas_to_stop[dep_id]

def test_downscale_prioritizes_non_running_replicas(self, ray_cluster):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current tests are wriiten as unit tests, if you see the rest of the tests in the file they are all written as integration tests. Let's maintain consistency with other tests.

Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
Comment thread python/ray/serve/tests/test_deployment_scheduler.py Outdated
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>

@jeffreywang88 jeffreywang88 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great contribution! let's add some more tests:

  • no tests for the bundle_label_selector path. both integ tests use label_selector.
  • test for head node interaction: verify that the head node replicas are still deprioritized even when they're on a fallback node
Comment thread python/ray/serve/_private/deployment_scheduler.py Outdated
Comment thread python/ray/serve/_private/deployment_scheduler.py Outdated
Comment thread python/ray/serve/_private/deployment_scheduler.py Outdated
Comment thread python/ray/serve/tests/test_deployment_scheduler.py Outdated
Comment thread python/ray/serve/tests/test_deployment_scheduler.py Outdated
Comment thread python/ray/serve/tests/test_deployment_scheduler.py Outdated
Comment thread python/ray/serve/_private/deployment_scheduler.py Outdated
Comment thread python/ray/serve/_private/deployment_scheduler.py Outdated
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
@manhld0206

Copy link
Copy Markdown
Contributor Author
  • test for head node interaction: verify that the head node replicas are still deprioritized even when they're on a fallback node

@jeffreywang-anyscale We should prioritize fallback node condition over head node shouldn't we? I think it makes more sense to always scale down node which doesn't match the primary label selector regradless it's the head node or not

Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Comment thread python/ray/serve/_private/deployment_scheduler.py
@jeffreywang88

jeffreywang88 commented Mar 20, 2026

Copy link
Copy Markdown
Contributor
  • test for head node interaction: verify that the head node replicas are still deprioritized even when they're on a fallback node

@jeffreywang-anyscale We should prioritize fallback node condition over head node shouldn't we? I think it makes more sense to always scale down node which doesn't match the primary label selector regardless it's the head node or not

The main goal of this PR is to free up spot instance resources. Head node hosts critical components like GCS, autoscaler, and Serve controller, so even though we evict replicas, we can't delete the underlying spot instance. Because of this, I think replicas on the head node should be deprioritized for eviction during scale-down. They are last in line, since removing them provides no benefit in terms of draining nodes.

While Ray offers GCS fault tolerance via external Redis, allowing workers to continue serving traffic during head node recovery, the control plane is unavailable during this period (e.g. no deployments can be updated). Avoiding head node loss remains important for cluster stability.

We can avoid scheduling head node on spot instances by configuring nodeSelector in the KubeRay manifest. But even if we do schedule head node on spot instances, I think we should keep the head node, and therefore there isn't much benefit to evict replicas on the head node.

Let's get a second opinion, @abrarsheikh WDYT?

@jeffreywang88

Copy link
Copy Markdown
Contributor

Could you please fix premerge test failures?

Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
@abrarsheikh

Copy link
Copy Markdown
Contributor
  • test for head node interaction: verify that the head node replicas are still deprioritized even when they're on a fallback node

@jeffreywang-anyscale We should prioritize fallback node condition over head node shouldn't we? I think it makes more sense to always scale down node which doesn't match the primary label selector regardless it's the head node or not

The main goal of this PR is to free up spot instance resources. Head node hosts critical components like GCS, autoscaler, and Serve controller, so even though we evict replicas, we can't delete the underlying spot instance. Because of this, I think replicas on the head node should be deprioritized for eviction during scale-down. They are last in line, since removing them provides no benefit in terms of draining nodes.

While Ray offers GCS fault tolerance via external Redis, allowing workers to continue serving traffic during head node recovery, the control plane is unavailable during this period (e.g. no deployments can be updated). Avoiding head node loss remains important for cluster stability.

We can avoid scheduling head node on spot instances by configuring nodeSelector in the KubeRay manifest. But even if we do schedule head node on spot instances, I think we should keep the head node, and therefore there isn't much benefit to evict replicas on the head node.

Let's get a second opinion, @abrarsheikh WDYT?

I agree with @jeffreywang-anyscale. The head node should be the last to go, behavior is consistent with what currently exists. Also, no serious user will run the head node of a spot instance, so this decision is moot.

Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
@manhld0206

Copy link
Copy Markdown
Contributor Author

@abrarsheikh @jeffreywang-anyscale Could you check again? I have updated code to make replicas on head node go down last

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

@jeffreywang88 jeffreywang88 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the work! pushed a minor fix for the tests to ensure we're testing the intended priorities.

@abrarsheikh abrarsheikh merged commit b610044 into ray-project:master Apr 3, 2026
6 checks passed
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…ay-project#61488)

> 💡 Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.
- Prioritze sclaing down replica which doesn't fit label_selector
(fallback_strategy enabled)

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".
- Related to ray-project#58959

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

4 participants