Skip to content

[core][ippr][7/N] Initial implementation for resizing pods in-place on Kubernetes 1.35#55961

Merged
edoakes merged 8 commits into
ray-project:masterfrom
rueian:autoscaler-ippr
May 6, 2026
Merged

[core][ippr][7/N] Initial implementation for resizing pods in-place on Kubernetes 1.35#55961
edoakes merged 8 commits into
ray-project:masterfrom
rueian:autoscaler-ippr

Conversation

@rueian

@rueian rueian commented Aug 26, 2025

Copy link
Copy Markdown
Contributor

In-place Pod Resizing (IPPR) Integration on Kubernetes 1.35

Things that have been done:

  1. IPPR JSON configuration and validation for users to enable the IPPR integration with Autoscaler v2.
  2. Resize a Pod's CPU and memory resource requests and limits to the maximums specified in one step by the config.

Configuration and Validation

Users can provide a ray.io/ippr annotation on their RayCluster CR to enable IPPR with Autoscaler v2:

{
  "groups": {
    "<groupName>": {
      "max-cpu":     string|number,  # K8s quantity (e.g. "2", "1500m")
      "max-memory":  string|integer, # K8s quantity (e.g. "8Gi", 2147483648)
      "resize-timeout": integer      # Seconds to wait for a pod resize to
                                     # complete before considering it timed out
    },
    ...
  }
}

groupName should match the names of Ray worker groups. In each group, max-cpu, max-memory, and resize-timeout are mandatory.

Besides the above configuration, we also validate:

  1. The corresponding worker groups can't have num-cpus and memory in their rayStartParams because they can cause Ray logical resource mismatch with pod resources.
  2. Worker groups should also have cpu and memory resource requests specified in their container specs.
  3. In addition, their container should have resizePolicy.restartPolicy set to NotRequired.

Resize Behavior

The current implementation will try to resize the existing nodes to the maximum specified by the user in one step if there are pending tasks that can fit on those nodes after resizing. We will implement gradual resizing and downsizing in later PRs. The detailed behavior is

  1. After filling pending tasks to the existing nodes with their current capacities, the autoscaler will try to fill the remaining pending tasks to those nodes that have no ongoing resize again, but with their maximum capacities specified in the ray.io/ippr annotation. If there are remaining pending tasks that can be fit on a node, the autoscaler will send its k8s resize request and record the resize status in a pod annotation, ray.io/ippr-status, at the end of the current reconciliation.
  2. If there are still pending tasks left, the autoscaler will do the original horizontal scale out, but with the maximum capacity of each worker type in consideration.
  3. At the beginning of the next reconciliation, the autoscaler will determine the next step for those resize that have been sent at the end of the previous reconciliation by looking into their statuses. The next step can be two cases:
    a) Finish the resize by adjusting the logical resources on the Raylet and update its ray.io/ippr-status.
    b) Adjust the resize by queueing a new k8s resize request due to a timeout or an error.
    Note that if the RPC to adjust the logical resources on the Raylet fails, the autoscaler will retry again in the next reconciliation because it doesn't update the corresponding ray.io/ippr-status.

Before: without IPPR

image

After: red boxes are new behaviors for IPPR

image

Example cases

The following are different cases for IPPR. They are all based on this cluster shape:

HeadNode * 1 with CPU=0 for simplicity.
WorkerGroup1:
CPU: 1 (can be resized to 4)
MaxReplicas >= 1

Case 1: Resizing an existing node

T1: (There is 1 worker idle and we have a pending request for 2 CPUs)
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]

Autoscaler will try resize the worker to 4 CPUs:
=> WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}]

T2: (The worker is resized and the pending request is consumed)
WorkerGroup1=[{CPU: 4, Available: 2}]
Pending Reqs=[]

Case 2: Resizing an existing worker and scaling out a new worker.

T1: (There is 1 worker idle and we have pending requests for 2 and 4 CPUs)
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}, {CPU: 4}]

Autoscaler will try resize the node to 4 CPUs and add a new worker:
=> WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}, {CPU: 1, Available: 1}]

T2: (The first worker is resized and the 4 CPUs request is consumed )
WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]

Autoscaler will then try resize the second worker to 4 CPUs:
=> WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 1 -> 4, Available: 1 -> 4}]

T3: (The second worker is resized and the remaining request is consumed)
WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 4, Available: 2}]
Pending Reqs=[]

Case 3: Scaling out with IPPR capacity considerations

T1: (no worker initially but there is a pending request for 2 CPUs)
WorkerGroup1=[]
Pending Reqs=[{CPU: 2}]

Autoscaler will still add a new node because the worker has the IPPR capacity:
=> WorkerGroup1=[{CPU: 1, Available: 1}]

T2:
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]

Then this will be the same as Case 1.

Case 4: IPPR timeout

T1:
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]

Autoscaler first tries to resize the node to 4 CPU:
=> WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}]

T2: (But if the IPPR times out,)
WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}]]
Pending Reqs=[{CPU: 2}]

Autoscaler will roll the IPPR back and scale out.
=> WorkerGroup1=[{CPU: 4 -> 1, Available: 4 -> 1}, {CPU: 1, Available: 1}]

T3:
WorkerGroup1=[{CPU: 1, Available: 1}, {CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]

Autoscaler will try to resize the second node to 4 CPU:
=> WorkerGroup1=[{CPU: 1, Available: 1},{CPU: 4 -> 1, Available: 4 -> 1}]

T4:
WorkerGroup1=[{CPU: 1, Available: 1}, {CPU: 4, Available: 2}]
Pending Reqs=[]

Additional notes

Things that are added in this PR:

  1. KubeRayIPPRProvider: An instance that provides helper methods like get_ippr_specs, get_ippr_statuses, and do_ippr_requests to the autoscaler for doing IPPR with related structs IPPRSpecs, and IPPRStatus.
  2. AsyncResizeLocalResourceInstances method has been added to RayletClientWithIoContext.

The above are unit tested.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Tested with Kubernetes 1.35
    • Release tests
    • This PR is not tested :(
@rueian rueian force-pushed the autoscaler-ippr branch 3 times, most recently from 07edb0f to cd521c0 Compare August 26, 2025 20:55
@rueian rueian requested a review from Copilot August 27, 2025 00:10

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces initial support for In-Place Pod Resize (IPPR) functionality in the Ray autoscaler for KubeRay clusters. IPPR allows pods to be resized without termination, improving resource utilization and reducing scheduling overhead by dynamically adjusting CPU and memory allocations based on demand.

Key changes:

  • Adds IPPR schema validation and typed data structures for group specifications and pod status tracking
  • Implements IPPR provider for KubeRay to handle resize requests and synchronization with Raylets
  • Integrates IPPR logic into the resource demand scheduler to prefer in-place resizing over launching new nodes

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
python/ray/autoscaler/v2/schema.py Defines IPPR data structures including IPPRSpecs, IPPRGroupSpec, and IPPRStatus
python/ray/autoscaler/v2/scheduler.py Integrates IPPR into scheduling logic to consider resizing existing pods before launching new ones
python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/ippr_provider.py New provider implementing IPPR operations including validation, pod resizing, and Raylet synchronization
python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py Integrates IPPR provider into KubeRay cloud provider
python/ray/autoscaler/v2/instance_manager/reconciler.py Connects IPPR functionality to the main autoscaler reconciliation loop
python/ray/autoscaler/v2/tests/test_ippr_provider.py Comprehensive test suite for IPPR provider functionality
python/ray/autoscaler/v2/tests/test_scheduler.py Tests for IPPR integration in the scheduler

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@rueian rueian changed the title [core][autoscaler][IPPR] Initial impl for resizing Pods in-place to the maximum configured by the user Aug 27, 2025
@rueian rueian added the go add ONLY when ready to merge, run all tests label Aug 27, 2025
@rueian rueian force-pushed the autoscaler-ippr branch 4 times, most recently from f661d9b to 388ba37 Compare August 27, 2025 06:15
@rueian rueian changed the title [core][autoscaler][IPPR] Initial impl for resizing pods in-place to the maximum configured by the user Aug 27, 2025
Comment thread ci/env/install-core-prerelease-dependencies.sh Outdated
DOC001: Method `__init__` Potential formatting errors in docstring. Error message: No specification for "Args": ""
DOC001: Function/method `__init__`: Potential formatting errors in docstring. Error message: No specification for "Args": "" (Note: DOC001 could trigger other unrelated violations under this function/method too. Please fix the docstring formatting first.)
DOC101: Method `KubeRayProvider.__init__`: Docstring contains fewer arguments than in function signature.
DOC103: Method `KubeRayProvider.__init__`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [cluster_name: str, k8s_api_client: Optional[IKubernetesHttpApiClient], provider_config: Dict[str, Any]].

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix lint.

url,
json.dumps(payload),
headers={**headers, "Content-type": "application/json-patch+json"},
headers={**headers, "Content-type": content_type},

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make content-type adjustable for different patch strategies.

self._ray_cluster = None
self._cached_instances: Dict[CloudInstanceId, CloudInstance]
self._ippr_provider = KubeRayIPPRProvider(
gcs_client=gcs_client, k8s_api_client=self._k8s_api_client

@rueian rueian Aug 28, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KubeRayIPPRProvider needs a gcs_client to adjust the size of a Raylet, and it also needs a k8s_api_client to patch pods.

@rueian rueian marked this pull request as ready for review August 28, 2025 16:07
@rueian rueian requested a review from a team as a code owner August 28, 2025 16:07
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core kubernetes labels Aug 28, 2025
@jackfrancis

Copy link
Copy Markdown
Contributor

I validated this on a 3 node (16 CPU cores each) cluster in Azure:

$ kubectl get pods -o='custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,CPU:.spec.containers[0].resources.limits.cpu,CPU:.spec.containers[0].resources.requests.cpu' -w
NAMESPACE   NAME                                STATUS    CPU    CPU
default     kuberay-operator-79947594b8-zbklb   Running   100m   100m
default     tpch-q1-sf-10-mtqx7-head-84n2m      Running   2      250m
default     tpch-q1-sf-10-mtqx7-head-84n2m      Running   2      250m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Running   1      500m
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13

Using a ray-operator image built from this PR commit. You can see above the 3 long-lived pods, and their CPU requests/limits values increasing over time (without a lifecycle event creating a new pod).

tl;dr IPPR confirmed

@edoakes

edoakes commented Sep 5, 2025

Copy link
Copy Markdown
Collaborator

Still in my review queue, sorry haven't gotten to it yet (it's a big one!)

@jjyao can you help review as well? I need to re-read a lot of autoscaler code.

@rueian

rueian commented Sep 5, 2025

Copy link
Copy Markdown
Contributor Author

Hi @edoakes @jjyao, the previous feedback I got is to replace grpcio with cython bindings, which I am currently working on, and I am also working on a new autoscaler document, which should also help walk through the autoscaler code. So, I think this PR is not in a hurry this week, but it would be really appreciated if I could get early feedback. 😃

@edoakes

edoakes commented Sep 5, 2025

Copy link
Copy Markdown
Collaborator

Hi @edoakes @jjyao, the previous feedback I got is to replace grpcio with cython bindings, which I am currently working on, and I am also working on a new autoscaler document, which should also help walk through the autoscaler code. So, I think this PR is not in a hurry this week, but it would be really appreciated if I could get early feedback. 😃

Sounds good. I will do a quick scan then and hold off to dive into the details.

@jackfrancis

Copy link
Copy Markdown
Contributor

@edoakes @jjyao @rueian anything I can do to help move this forward?

cc @marosset

@jackfrancis

Copy link
Copy Markdown
Contributor

PSA: in-place is planned for graduation to GA in v1.35.0: kubernetes/enhancements#5562

@github-actions

github-actions Bot commented Oct 8, 2025

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 8, 2025
@jackfrancis

Copy link
Copy Markdown
Contributor

@jjyao @edoakes bumping this to undo stale status

"""
return (
self.resizing_at is not None
and not self.need_sync_with_raylet()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1644 to +1657
for node in existing_nodes:
if node.ippr_status is not None:
if ( # Reflect finished / ongoing IPPR in node capacity
node.ippr_status.is_k8s_resize_finished()
or node.ippr_status.is_in_progress()
):
# While a resize is ongoing or just completed, use desired values
# as the node's capacity so binpacking can consider the change.
node.update_total_resources(
{
"CPU": node.ippr_status.desired_cpu,
"memory": node.ippr_status.desired_memory,
}
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step maps to the "Extend node's availabilities for finished/ongoing IPPR" in the diagram in the PR description.

Comment on lines +1683 to +1702
ippr_candidates = []

for node in existing_nodes:
if node.ippr_status is not None and node.ippr_status.can_resize_up():
ippr_candidates.append(node)
else:
target_nodes.append(node)

original_ippr_candidates = {
node.im_instance_id: copy.deepcopy(node) for node in ippr_candidates
}
for node in ippr_candidates:
# Expose per-node maximums so binpacking can evaluate placing more work
# by upsizing in-place rather than launching new nodes.
node.update_total_resources(
{
"CPU": node.ippr_status.max_cpu(),
"memory": node.ippr_status.max_memory(),
}
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step maps to the "Extend node's availabilities if they are able to do IPPR" in the diagram in the PR description.

Comment on lines +1704 to +1723
while len(requests_to_sched) > 0 and len(ippr_candidates) > 0:
(
best_node,
requests_to_sched,
ippr_candidates,
) = ResourceDemandScheduler._sched_best_node(
requests_to_sched,
ippr_candidates,
resource_request_source,
ctx.get_cloud_resource_availabilities(),
)
if best_node is None:
# No ippr nodes can schedule any more requests.
break

# Commit an IPPR action on the selected node to its max effective caps.
best_node.ippr_status.queue_resize_request(
desired_cpu=best_node.ippr_status.max_cpu(),
desired_memory=best_node.ippr_status.max_memory(),
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step maps to the "Bin pack remaining demands to extended nodes" in the diagram in the PR description.

@rueian rueian changed the title [core][autoscaler][IPPR] Initial implementation for resizing pods in-place on Kubernetes 1.35 Apr 27, 2026

@Yicheng-Lu-llll Yicheng-Lu-llll left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pr! left some nits

Comment thread python/ray/autoscaler/v2/scheduler.py
Comment thread python/ray/autoscaler/v2/scheduler.py
Comment thread python/ray/autoscaler/v2/scheduler.py Outdated
Comment thread python/ray/autoscaler/v2/tests/test_scheduler.py Outdated
status=SchedulingNodeStatus.TO_LAUNCH,
node_kind=NodeKind.WORKER,
)
# If the new node can be resized, consider its maximum IPPR capacity.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it feels like we could directly create pods with the desired IPPR resource size if needed. Right now, we first start/run a pod and then wait for a follow up reconcile to perform IPPR. We could optimize this in a separate PR if needed.

I'm also thinking through a scenario.

Suppose K8s only has 2 CPUs left, and we get a request to schedule a task that needs 2 CPUs. Our available nodes have a default resource of 1 CPU but can be IPPR'd up to 3 CPUs.

Before IPPR, I think we would simply say we can't schedule this task, right?

After IPPR, it feels like we might end up starting two nodes each with 1 CPU, and then both IPPR attempts would fail (since K8s is out of CPU).

At that point we'd realize we can't schedule the task, but would we then clean up these two extra nodes?

If we do clean up, will we ends up restart running the pods and end up in a loop?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before IPPR, I think we would simply say we can't schedule this task, right?

Correct.

After IPPR, it feels like we might end up starting two nodes each with 1 CPU, and then both IPPR attempts would fail (since K8s is out of CPU).

With this IPPR, the autoscaler will create 1 Pod, starting with 1 CPU, and then try to resize it to 3 CPUs. K8s will tell the autoscaler it has 2 CPUs left, and the autoscaler will resize the pod again to 2 CPUs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so if a task requires 3 CPUs, we might end up in a loop where we create a pod, resize it to 2 CPUs, let it sit idle, terminate it, and then start over again after some time.

That said, I think this is a corner case and I don't have a strong opinion on it. Just flag it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we truly have no capacity, we can have a loop like this, and to be honest, we don't have too many choices.

One benefit of creating a Pod with 3 CPUs directly is that it can trigger horizontal scaling on the underlying k8s, while the current approach relies much more on the vertical scaling capability of a k8s node.

Comment thread python/ray/autoscaler/v2/tests/test_scheduler.py Outdated
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

@Kunchd Kunchd left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see everything come together! Left some nits and questions.

self.total_resources[resource_name] = max(0.0, new_total)
for available in self.available_resources_for_sched.values():
available[resource_name] = max(
0.0, available.get(resource_name, 0.0) + delta

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we support downsizing in the future, how will we decide where the existing resources on this node that can no longer fit onto this downsized node will go?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those resources will be out of our control if we downsize the node. That will allow other pods to use them.

Comment thread python/ray/autoscaler/v2/scheduler.py
assert len(reply.to_ippr) == 1
assert reply.to_ippr[0].cloud_instance_id == "pod-1"
assert reply.to_ippr[0].desired_cpu == 4.0
assert reply.to_ippr[0].desired_memory == 8 * 1024 * 1024 * 1024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference: Could we also ensure no new node creation was issued?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

# Pending nodes without a ray_node_id should not be selected for IPPR.
assert reply.to_ippr == []
# Scheduler should launch a new node instead of overestimating pending capacity.
to_launch, _ = _launch_and_terminate(reply)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the scheduler actually create a new node if the pending node will be able to meet the request with IPPR once it starts? Couldn't we also wait until the node starts and trigger IPPR on a future iteration?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Good catch! We should not create a new node at this point if the pending node will be able to meet the request.

Comment thread python/ray/autoscaler/v2/tests/test_scheduler.py Outdated
# Only one IPPR candidate is selected for this gang request.
assert len(reply.to_ippr) == 1
assert {status.cloud_instance_id for status in reply.to_ippr} == {"pod-1"}
assert {status.desired_cpu for status in reply.to_ippr} == {4.0}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to check that instance 2 is not modified here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

rueian added 2 commits April 29, 2026 14:48
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 79a1405. Configure here.

Comment thread python/ray/autoscaler/v2/scheduler.py Outdated
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Kunchd Kunchd left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

@rueian

rueian commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

Hi @edoakes, please take a look and merge when you have a chance. Thank you!

@edoakes edoakes merged commit 8445b42 into ray-project:master May 6, 2026
6 checks passed
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…n Kubernetes 1.35 (ray-project#55961)

## In-place Pod Resizing (IPPR) Integration on Kubernetes 1.35


Things that have been done:
1. IPPR JSON configuration and validation for users to enable the IPPR
integration with Autoscaler v2.
2. Resize a Pod's CPU and memory resource requests and limits to the
maximums specified in one step by the config.

## Configuration and Validation

Users can provide a `ray.io/ippr` annotation on their RayCluster CR to
enable IPPR with Autoscaler v2:
```
{
  "groups": {
    "<groupName>": {
      "max-cpu":     string|number,  # K8s quantity (e.g. "2", "1500m")
      "max-memory":  string|integer, # K8s quantity (e.g. "8Gi", 2147483648)
      "resize-timeout": integer      # Seconds to wait for a pod resize to
                                     # complete before considering it timed out
    },
    ...
  }
}
```
`groupName` should match the names of Ray worker groups. In each group,
`max-cpu`, `max-memory`, and `resize-timeout` are mandatory.

Besides the above configuration, we also validate:
1. The corresponding worker groups can't have `num-cpus` and `memory` in
their `rayStartParams` because they can cause Ray logical resource
mismatch with pod resources.
2. Worker groups should also have `cpu` and `memory` resource requests
specified in their container specs.
3. In addition, their container should have `resizePolicy.restartPolicy`
set to `NotRequired`.

 ## Resize Behavior

The current implementation will try to resize the existing nodes to the
maximum specified by the user in one step if there are pending tasks
that can fit on those nodes after resizing. We will implement gradual
resizing and downsizing in later PRs. The detailed behavior is

1. After filling pending tasks to the existing nodes with their current
capacities, the autoscaler will try to fill the remaining pending tasks
to those nodes that have no ongoing resize again, but with their maximum
capacities specified in the `ray.io/ippr` annotation. If there are
remaining pending tasks that can be fit on a node, the autoscaler will
send its k8s resize request and record the resize status in a pod
annotation, `ray.io/ippr-status`, at the end of the current
reconciliation.
2. If there are still pending tasks left, the autoscaler will do the
original horizontal scale out, but with the maximum capacity of each
worker type in consideration.
3. At the beginning of the next reconciliation, the autoscaler will
determine the next step for those resize that have been sent at the end
of the previous reconciliation by looking into their statuses. The next
step can be two cases:
**a)** Finish the resize by adjusting the logical resources on the
Raylet and update its `ray.io/ippr-status`.
**b)** Adjust the resize by queueing a new k8s resize request due to a
timeout or an error.
Note that if the RPC to adjust the logical resources on the Raylet
fails, the autoscaler will retry again in the next reconciliation
because it doesn't update the corresponding `ray.io/ippr-status`.

### Before: without IPPR
<img width="940" height="428" alt="image"
src="https://github.com/user-attachments/assets/deba6ece-4936-4658-ab29-a6eea5ec49ad"
/>

### After: red boxes are new behaviors for IPPR
<img width="947" height="830" alt="image"
src="https://github.com/user-attachments/assets/1ddd03fe-74bc-45d3-a384-895fa5c52dbc"
/>


### Example cases

The following are different cases for IPPR. They are all based on this
cluster shape:

HeadNode * 1 with CPU=0 for simplicity.
WorkerGroup1:
	CPU: 1 (can be resized to 4)
	MaxReplicas >= 1

#### Case 1: Resizing an existing node

**T1**: (There is 1 worker idle and we have a pending request for 2
CPUs)
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]

Autoscaler will try resize the worker to 4 CPUs:
=> WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}]

**T2**: (The worker is resized and the pending request is consumed)
WorkerGroup1=[{CPU: 4, Available: 2}]
Pending Reqs=[]


#### Case 2: Resizing an existing worker and scaling out a new worker.

**T1**: (There is 1 worker idle and we have pending requests for 2 and 4
CPUs)
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}, {CPU: 4}]

Autoscaler will try resize the node to 4 CPUs and add a new worker:
=> WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}, {CPU: 1, Available:
1}]

**T2**: (The first worker is resized and the 4 CPUs request is consumed
)
WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]

Autoscaler will then try resize the second worker to 4 CPUs:
=> WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 1 -> 4, Available: 1 ->
4}]

**T3**: (The second worker is resized and the remaining request is
consumed)
WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 4, Available: 2}]
Pending Reqs=[]


#### Case 3: Scaling out with IPPR capacity considerations

**T1**: (no worker initially but there is a pending request for 2 CPUs)
WorkerGroup1=[]
Pending Reqs=[{CPU: 2}]

Autoscaler will still add a new node because the worker has the IPPR
capacity:
=> WorkerGroup1=[{CPU: 1, Available: 1}]

**T2**:
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]

Then this will be the same as Case 1.


#### Case 4: IPPR timeout

**T1**: 
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]

Autoscaler first tries to resize the node to 4 CPU:
=> WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}]

**T2**: (But if the IPPR times out,)
WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}]]
Pending Reqs=[{CPU: 2}]


Autoscaler will roll the IPPR back and scale out.
=> WorkerGroup1=[{CPU: 4 -> 1, Available: 4 -> 1}, {CPU: 1, Available:
1}]

**T3**: 
WorkerGroup1=[{CPU: 1, Available: 1}, {CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]

Autoscaler will try to resize the second node to 4 CPU:
=> WorkerGroup1=[{CPU: 1, Available: 1},{CPU: 4 -> 1, Available: 4 ->
1}]

**T4**:
WorkerGroup1=[{CPU: 1, Available: 1}, {CPU: 4, Available: 2}]
Pending Reqs=[]


## Additional notes

Things that are added in this PR:
1. `KubeRayIPPRProvider`: An instance that provides helper methods like
`get_ippr_specs`, `get_ippr_statuses`, and `do_ippr_requests` to the
autoscaler for doing IPPR with related structs `IPPRSpecs`, and
`IPPRStatus`.
2. `AsyncResizeLocalResourceInstances` method has been added to
`RayletClientWithIoContext`.

The above are unit tested.


## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [x] Tested with Kubernetes 1.35
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: Yicheng-Lu-llll <51814063+Yicheng-Lu-llll@users.noreply.github.com>
edoakes pushed a commit that referenced this pull request Jun 16, 2026
> Thank you for contributing to Ray! 🚀
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> 💡 Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.
User facing docs for
#55961 (comment)

## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".
Docs for #55961

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
- TODO: For master should I say v1.7 for KubeRay or not yet?
- For the different cases I generated those text examples but would
diagrams be better? Happy to create diagrams unless we feel this is ok

---------

Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>
Signed-off-by: Alima Azamat <92766804+alimaazamat@users.noreply.github.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
> Thank you for contributing to Ray! 🚀
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> 💡 Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.
User facing docs for
ray-project#55961 (comment)

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".
Docs for ray-project#55961

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
- TODO: For master should I say v1.7 for KubeRay or not yet?
- For the different cases I generated those text examples but would
diagrams be better? Happy to create diagrams unless we feel this is ok

---------

Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>
Signed-off-by: Alima Azamat <92766804+alimaazamat@users.noreply.github.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests kubernetes unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

6 participants