[core][ippr][7/N] Initial implementation for resizing pods in-place on Kubernetes 1.35 by rueian · Pull Request #55961 · ray-project/ray

rueian · 2025-08-26T18:37:04Z

In-place Pod Resizing (IPPR) Integration on Kubernetes 1.35

Things that have been done:

IPPR JSON configuration and validation for users to enable the IPPR integration with Autoscaler v2.
Resize a Pod's CPU and memory resource requests and limits to the maximums specified in one step by the config.

Configuration and Validation

Users can provide a ray.io/ippr annotation on their RayCluster CR to enable IPPR with Autoscaler v2:

{
  "groups": {
    "<groupName>": {
      "max-cpu":     string|number,  # K8s quantity (e.g. "2", "1500m")
      "max-memory":  string|integer, # K8s quantity (e.g. "8Gi", 2147483648)
      "resize-timeout": integer      # Seconds to wait for a pod resize to
                                     # complete before considering it timed out
    },
    ...
  }
}

groupName should match the names of Ray worker groups. In each group, max-cpu, max-memory, and resize-timeout are mandatory.

Besides the above configuration, we also validate:

The corresponding worker groups can't have num-cpus and memory in their rayStartParams because they can cause Ray logical resource mismatch with pod resources.
Worker groups should also have cpu and memory resource requests specified in their container specs.
In addition, their container should have resizePolicy.restartPolicy set to NotRequired.

Resize Behavior

The current implementation will try to resize the existing nodes to the maximum specified by the user in one step if there are pending tasks that can fit on those nodes after resizing. We will implement gradual resizing and downsizing in later PRs. The detailed behavior is

After filling pending tasks to the existing nodes with their current capacities, the autoscaler will try to fill the remaining pending tasks to those nodes that have no ongoing resize again, but with their maximum capacities specified in the ray.io/ippr annotation. If there are remaining pending tasks that can be fit on a node, the autoscaler will send its k8s resize request and record the resize status in a pod annotation, ray.io/ippr-status, at the end of the current reconciliation.
If there are still pending tasks left, the autoscaler will do the original horizontal scale out, but with the maximum capacity of each worker type in consideration.
At the beginning of the next reconciliation, the autoscaler will determine the next step for those resize that have been sent at the end of the previous reconciliation by looking into their statuses. The next step can be two cases:
a) Finish the resize by adjusting the logical resources on the Raylet and update its ray.io/ippr-status.
b) Adjust the resize by queueing a new k8s resize request due to a timeout or an error.
Note that if the RPC to adjust the logical resources on the Raylet fails, the autoscaler will retry again in the next reconciliation because it doesn't update the corresponding ray.io/ippr-status.

Before: without IPPR

After: red boxes are new behaviors for IPPR

Example cases

The following are different cases for IPPR. They are all based on this cluster shape:

HeadNode * 1 with CPU=0 for simplicity.
WorkerGroup1:
CPU: 1 (can be resized to 4)
MaxReplicas >= 1

Case 1: Resizing an existing node

T1: (There is 1 worker idle and we have a pending request for 2 CPUs)
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]

Autoscaler will try resize the worker to 4 CPUs:
=> WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}]

T2: (The worker is resized and the pending request is consumed)
WorkerGroup1=[{CPU: 4, Available: 2}]
Pending Reqs=[]

Case 2: Resizing an existing worker and scaling out a new worker.

T1: (There is 1 worker idle and we have pending requests for 2 and 4 CPUs)
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}, {CPU: 4}]

Autoscaler will try resize the node to 4 CPUs and add a new worker:
=> WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}, {CPU: 1, Available: 1}]

T2: (The first worker is resized and the 4 CPUs request is consumed )
WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]

Autoscaler will then try resize the second worker to 4 CPUs:
=> WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 1 -> 4, Available: 1 -> 4}]

T3: (The second worker is resized and the remaining request is consumed)
WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 4, Available: 2}]
Pending Reqs=[]

Case 3: Scaling out with IPPR capacity considerations

T1: (no worker initially but there is a pending request for 2 CPUs)
WorkerGroup1=[]
Pending Reqs=[{CPU: 2}]

Autoscaler will still add a new node because the worker has the IPPR capacity:
=> WorkerGroup1=[{CPU: 1, Available: 1}]

T2:
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]

Then this will be the same as Case 1.

Case 4: IPPR timeout

T1:
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]

Autoscaler first tries to resize the node to 4 CPU:
=> WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}]

T2: (But if the IPPR times out,)
WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}]]
Pending Reqs=[{CPU: 2}]

Autoscaler will roll the IPPR back and scale out.
=> WorkerGroup1=[{CPU: 4 -> 1, Available: 4 -> 1}, {CPU: 1, Available: 1}]

T3:
WorkerGroup1=[{CPU: 1, Available: 1}, {CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]

Autoscaler will try to resize the second node to 4 CPU:
=> WorkerGroup1=[{CPU: 1, Available: 1},{CPU: 4 -> 1, Available: 4 -> 1}]

T4:
WorkerGroup1=[{CPU: 1, Available: 1}, {CPU: 4, Available: 2}]
Pending Reqs=[]

Additional notes

Things that are added in this PR:

KubeRayIPPRProvider: An instance that provides helper methods like get_ippr_specs, get_ippr_statuses, and do_ippr_requests to the autoscaler for doing IPPR with related structs IPPRSpecs, and IPPRStatus.
AsyncResizeLocalResourceInstances method has been added to RayletClientWithIoContext.

The above are unit tested.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Tested with Kubernetes 1.35
- Release tests
- This PR is not tested :(

Copilot

Pull Request Overview

This PR introduces initial support for In-Place Pod Resize (IPPR) functionality in the Ray autoscaler for KubeRay clusters. IPPR allows pods to be resized without termination, improving resource utilization and reducing scheduling overhead by dynamically adjusting CPU and memory allocations based on demand.

Key changes:

Adds IPPR schema validation and typed data structures for group specifications and pod status tracking
Implements IPPR provider for KubeRay to handle resize requests and synchronization with Raylets
Integrates IPPR logic into the resource demand scheduler to prefer in-place resizing over launching new nodes

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`python/ray/autoscaler/v2/schema.py`	Defines IPPR data structures including IPPRSpecs, IPPRGroupSpec, and IPPRStatus
`python/ray/autoscaler/v2/scheduler.py`	Integrates IPPR into scheduling logic to consider resizing existing pods before launching new ones
`python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/ippr_provider.py`	New provider implementing IPPR operations including validation, pod resizing, and Raylet synchronization
`python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py`	Integrates IPPR provider into KubeRay cloud provider
`python/ray/autoscaler/v2/instance_manager/reconciler.py`	Connects IPPR functionality to the main autoscaler reconciliation loop
`python/ray/autoscaler/v2/tests/test_ippr_provider.py`	Comprehensive test suite for IPPR provider functionality
`python/ray/autoscaler/v2/tests/test_scheduler.py`	Tests for IPPR integration in the scheduler

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

rueian · 2025-08-28T05:43:20Z

-    DOC001: Method `__init__` Potential formatting errors in docstring. Error message: No specification for "Args": ""
-    DOC001: Function/method `__init__`: Potential formatting errors in docstring. Error message: No specification for "Args": "" (Note: DOC001 could trigger other unrelated violations under this function/method too. Please fix the docstring formatting first.)
-    DOC101: Method `KubeRayProvider.__init__`: Docstring contains fewer arguments than in function signature.
-    DOC103: Method `KubeRayProvider.__init__`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [cluster_name: str, k8s_api_client: Optional[IKubernetesHttpApiClient], provider_config: Dict[str, Any]].


rueian · 2025-08-28T05:44:08Z

            url,
            json.dumps(payload),
-            headers={**headers, "Content-type": "application/json-patch+json"},
+            headers={**headers, "Content-type": content_type},


Make content-type adjustable for different patch strategies.

rueian · 2025-08-28T05:45:19Z

        self._ray_cluster = None
        self._cached_instances: Dict[CloudInstanceId, CloudInstance]
+        self._ippr_provider = KubeRayIPPRProvider(
+            gcs_client=gcs_client, k8s_api_client=self._k8s_api_client


The KubeRayIPPRProvider needs a gcs_client to adjust the size of a Raylet, and it also needs a k8s_api_client to patch pods.

jackfrancis · 2025-09-04T23:48:01Z

I validated this on a 3 node (16 CPU cores each) cluster in Azure:

$ kubectl get pods -o='custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,CPU:.spec.containers[0].resources.limits.cpu,CPU:.spec.containers[0].resources.requests.cpu' -w
NAMESPACE   NAME                                STATUS    CPU    CPU
default     kuberay-operator-79947594b8-zbklb   Running   100m   100m
default     tpch-q1-sf-10-mtqx7-head-84n2m      Running   2      250m
default     tpch-q1-sf-10-mtqx7-head-84n2m      Running   2      250m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Running   1      500m
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13

Using a ray-operator image built from this PR commit. You can see above the 3 long-lived pods, and their CPU requests/limits values increasing over time (without a lifecycle event creating a new pod).

tl;dr IPPR confirmed

edoakes · 2025-09-05T21:25:51Z

Still in my review queue, sorry haven't gotten to it yet (it's a big one!)

@jjyao can you help review as well? I need to re-read a lot of autoscaler code.

rueian · 2025-09-05T22:01:22Z

Hi @edoakes @jjyao, the previous feedback I got is to replace grpcio with cython bindings, which I am currently working on, and I am also working on a new autoscaler document, which should also help walk through the autoscaler code. So, I think this PR is not in a hurry this week, but it would be really appreciated if I could get early feedback. 😃

edoakes · 2025-09-05T22:42:57Z

Hi @edoakes @jjyao, the previous feedback I got is to replace grpcio with cython bindings, which I am currently working on, and I am also working on a new autoscaler document, which should also help walk through the autoscaler code. So, I think this PR is not in a hurry this week, but it would be really appreciated if I could get early feedback. 😃

Sounds good. I will do a quick scan then and hold off to dive into the details.

jackfrancis · 2025-09-18T21:21:24Z

@edoakes @jjyao @rueian anything I can do to help move this forward?

cc @marosset

jackfrancis · 2025-09-23T23:10:29Z

PSA: in-place is planned for graduation to GA in v1.35.0: kubernetes/enhancements#5562

github-actions · 2025-10-08T00:35:25Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

jackfrancis · 2025-10-09T17:53:32Z

@jjyao @edoakes bumping this to undo stale status

rueian · 2026-04-27T18:38:05Z

        """
        return (
            self.resizing_at is not None
+            and not self.need_sync_with_raylet()


This is to address This is to address https://github.com/ray-project/ray/pull/55961/changes#r3149423782

rueian · 2026-04-27T18:47:31Z

+        for node in existing_nodes:
+            if node.ippr_status is not None:
+                if (  # Reflect finished / ongoing IPPR in node capacity
+                    node.ippr_status.is_k8s_resize_finished()
+                    or node.ippr_status.is_in_progress()
+                ):
+                    # While a resize is ongoing or just completed, use desired values
+                    # as the node's capacity so binpacking can consider the change.
+                    node.update_total_resources(
+                        {
+                            "CPU": node.ippr_status.desired_cpu,
+                            "memory": node.ippr_status.desired_memory,
+                        }
+                    )


This step maps to the "Extend node's availabilities for finished/ongoing IPPR" in the diagram in the PR description.

rueian · 2026-04-27T18:49:43Z

+        ippr_candidates = []
+
+        for node in existing_nodes:
+            if node.ippr_status is not None and node.ippr_status.can_resize_up():
+                ippr_candidates.append(node)
+            else:
+                target_nodes.append(node)
+
+        original_ippr_candidates = {
+            node.im_instance_id: copy.deepcopy(node) for node in ippr_candidates
+        }
+        for node in ippr_candidates:
+            # Expose per-node maximums so binpacking can evaluate placing more work
+            # by upsizing in-place rather than launching new nodes.
+            node.update_total_resources(
+                {
+                    "CPU": node.ippr_status.max_cpu(),
+                    "memory": node.ippr_status.max_memory(),
+                }
+            )


This step maps to the "Extend node's availabilities if they are able to do IPPR" in the diagram in the PR description.

rueian · 2026-04-27T18:50:56Z

+        while len(requests_to_sched) > 0 and len(ippr_candidates) > 0:
+            (
+                best_node,
+                requests_to_sched,
+                ippr_candidates,
+            ) = ResourceDemandScheduler._sched_best_node(
+                requests_to_sched,
+                ippr_candidates,
+                resource_request_source,
+                ctx.get_cloud_resource_availabilities(),
+            )
+            if best_node is None:
+                # No ippr nodes can schedule any more requests.
+                break
+
+            # Commit an IPPR action on the selected node to its max effective caps.
+            best_node.ippr_status.queue_resize_request(
+                desired_cpu=best_node.ippr_status.max_cpu(),
+                desired_memory=best_node.ippr_status.max_memory(),
+            )


This step maps to the "Bin pack remaining demands to extended nodes" in the diagram in the PR description.

Yicheng-Lu-llll

Thanks for the pr! left some nits

Yicheng-Lu-llll · 2026-04-28T23:56:38Z

                status=SchedulingNodeStatus.TO_LAUNCH,
                node_kind=NodeKind.WORKER,
            )
+            # If the new node can be resized, consider its maximum IPPR capacity.


nit: it feels like we could directly create pods with the desired IPPR resource size if needed. Right now, we first start/run a pod and then wait for a follow up reconcile to perform IPPR. We could optimize this in a separate PR if needed.

I'm also thinking through a scenario.

Suppose K8s only has 2 CPUs left, and we get a request to schedule a task that needs 2 CPUs. Our available nodes have a default resource of 1 CPU but can be IPPR'd up to 3 CPUs.

Before IPPR, I think we would simply say we can't schedule this task, right?

After IPPR, it feels like we might end up starting two nodes each with 1 CPU, and then both IPPR attempts would fail (since K8s is out of CPU).

At that point we'd realize we can't schedule the task, but would we then clean up these two extra nodes?

If we do clean up, will we ends up restart running the pods and end up in a loop?

Before IPPR, I think we would simply say we can't schedule this task, right?

Correct.

After IPPR, it feels like we might end up starting two nodes each with 1 CPU, and then both IPPR attempts would fail (since K8s is out of CPU).

With this IPPR, the autoscaler will create 1 Pod, starting with 1 CPU, and then try to resize it to 3 CPUs. K8s will tell the autoscaler it has 2 CPUs left, and the autoscaler will resize the pod again to 2 CPUs.

Got it, so if a task requires 3 CPUs, we might end up in a loop where we create a pod, resize it to 2 CPUs, let it sit idle, terminate it, and then start over again after some time.

That said, I think this is a corner case and I don't have a strong opinion on it. Just flag it.

Yeah, if we truly have no capacity, we can have a loop like this, and to be honest, we don't have too many choices.

One benefit of creating a Pod with 3 CPUs directly is that it can trigger horizontal scaling on the underlying k8s, while the current approach relies much more on the vertical scaling capability of a k8s node.

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

Kunchd

Glad to see everything come together! Left some nits and questions.

Kunchd · 2026-04-29T18:02:54Z

+            self.total_resources[resource_name] = max(0.0, new_total)
+            for available in self.available_resources_for_sched.values():
+                available[resource_name] = max(
+                    0.0, available.get(resource_name, 0.0) + delta


If we support downsizing in the future, how will we decide where the existing resources on this node that can no longer fit onto this downsized node will go?

Those resources will be out of our control if we downsize the node. That will allow other pods to use them.

Kunchd · 2026-04-29T20:00:07Z

+    assert len(reply.to_ippr) == 1
+    assert reply.to_ippr[0].cloud_instance_id == "pod-1"
+    assert reply.to_ippr[0].desired_cpu == 4.0
+    assert reply.to_ippr[0].desired_memory == 8 * 1024 * 1024 * 1024


Personal preference: Could we also ensure no new node creation was issued?

Kunchd · 2026-04-29T20:18:40Z

+    # Pending nodes without a ray_node_id should not be selected for IPPR.
+    assert reply.to_ippr == []
+    # Scheduler should launch a new node instead of overestimating pending capacity.
+    to_launch, _ = _launch_and_terminate(reply)


Should the scheduler actually create a new node if the pending node will be able to meet the request with IPPR once it starts? Couldn't we also wait until the node starts and trigger IPPR on a future iteration?

Makes sense. Good catch! We should not create a new node at this point if the pending node will be able to meet the request.

Kunchd · 2026-04-29T20:23:49Z

+    # Only one IPPR candidate is selected for this gang request.
+    assert len(reply.to_ippr) == 1
+    assert {status.cloud_instance_id for status in reply.to_ippr} == {"pod-1"}
+    assert {status.desired_cpu for status in reply.to_ippr} == {4.0}


Do we want to check that instance 2 is not modified here?

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 79a1405. Configure here.}

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Kunchd

LGTM thanks!

rueian · 2026-05-05T16:32:42Z

Hi @edoakes, please take a look and merge when you have a chance. Thank you!

…n Kubernetes 1.35 (ray-project#55961) ## In-place Pod Resizing (IPPR) Integration on Kubernetes 1.35 Things that have been done: 1. IPPR JSON configuration and validation for users to enable the IPPR integration with Autoscaler v2. 2. Resize a Pod's CPU and memory resource requests and limits to the maximums specified in one step by the config. ## Configuration and Validation Users can provide a `ray.io/ippr` annotation on their RayCluster CR to enable IPPR with Autoscaler v2: ``` { "groups": { "<groupName>": { "max-cpu": string|number, # K8s quantity (e.g. "2", "1500m") "max-memory": string|integer, # K8s quantity (e.g. "8Gi", 2147483648) "resize-timeout": integer # Seconds to wait for a pod resize to # complete before considering it timed out }, ... } } ``` `groupName` should match the names of Ray worker groups. In each group, `max-cpu`, `max-memory`, and `resize-timeout` are mandatory. Besides the above configuration, we also validate: 1. The corresponding worker groups can't have `num-cpus` and `memory` in their `rayStartParams` because they can cause Ray logical resource mismatch with pod resources. 2. Worker groups should also have `cpu` and `memory` resource requests specified in their container specs. 3. In addition, their container should have `resizePolicy.restartPolicy` set to `NotRequired`. ## Resize Behavior The current implementation will try to resize the existing nodes to the maximum specified by the user in one step if there are pending tasks that can fit on those nodes after resizing. We will implement gradual resizing and downsizing in later PRs. The detailed behavior is 1. After filling pending tasks to the existing nodes with their current capacities, the autoscaler will try to fill the remaining pending tasks to those nodes that have no ongoing resize again, but with their maximum capacities specified in the `ray.io/ippr` annotation. If there are remaining pending tasks that can be fit on a node, the autoscaler will send its k8s resize request and record the resize status in a pod annotation, `ray.io/ippr-status`, at the end of the current reconciliation. 2. If there are still pending tasks left, the autoscaler will do the original horizontal scale out, but with the maximum capacity of each worker type in consideration. 3. At the beginning of the next reconciliation, the autoscaler will determine the next step for those resize that have been sent at the end of the previous reconciliation by looking into their statuses. The next step can be two cases: **a)** Finish the resize by adjusting the logical resources on the Raylet and update its `ray.io/ippr-status`. **b)** Adjust the resize by queueing a new k8s resize request due to a timeout or an error. Note that if the RPC to adjust the logical resources on the Raylet fails, the autoscaler will retry again in the next reconciliation because it doesn't update the corresponding `ray.io/ippr-status`. ### Before: without IPPR <img width="940" height="428" alt="image" src="https://github.com/user-attachments/assets/deba6ece-4936-4658-ab29-a6eea5ec49ad" /> ### After: red boxes are new behaviors for IPPR <img width="947" height="830" alt="image" src="https://github.com/user-attachments/assets/1ddd03fe-74bc-45d3-a384-895fa5c52dbc" /> ### Example cases The following are different cases for IPPR. They are all based on this cluster shape: HeadNode * 1 with CPU=0 for simplicity. WorkerGroup1: CPU: 1 (can be resized to 4) MaxReplicas >= 1 #### Case 1: Resizing an existing node **T1**: (There is 1 worker idle and we have a pending request for 2 CPUs) WorkerGroup1=[{CPU: 1, Available: 1}] Pending Reqs=[{CPU: 2}] Autoscaler will try resize the worker to 4 CPUs: => WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}] **T2**: (The worker is resized and the pending request is consumed) WorkerGroup1=[{CPU: 4, Available: 2}] Pending Reqs=[] #### Case 2: Resizing an existing worker and scaling out a new worker. **T1**: (There is 1 worker idle and we have pending requests for 2 and 4 CPUs) WorkerGroup1=[{CPU: 1, Available: 1}] Pending Reqs=[{CPU: 2}, {CPU: 4}] Autoscaler will try resize the node to 4 CPUs and add a new worker: => WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}, {CPU: 1, Available: 1}] **T2**: (The first worker is resized and the 4 CPUs request is consumed ) WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 1, Available: 1}] Pending Reqs=[{CPU: 2}] Autoscaler will then try resize the second worker to 4 CPUs: => WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 1 -> 4, Available: 1 -> 4}] **T3**: (The second worker is resized and the remaining request is consumed) WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 4, Available: 2}] Pending Reqs=[] #### Case 3: Scaling out with IPPR capacity considerations **T1**: (no worker initially but there is a pending request for 2 CPUs) WorkerGroup1=[] Pending Reqs=[{CPU: 2}] Autoscaler will still add a new node because the worker has the IPPR capacity: => WorkerGroup1=[{CPU: 1, Available: 1}] **T2**: WorkerGroup1=[{CPU: 1, Available: 1}] Pending Reqs=[{CPU: 2}] Then this will be the same as Case 1. #### Case 4: IPPR timeout **T1**: WorkerGroup1=[{CPU: 1, Available: 1}] Pending Reqs=[{CPU: 2}] Autoscaler first tries to resize the node to 4 CPU: => WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}] **T2**: (But if the IPPR times out,) WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}]] Pending Reqs=[{CPU: 2}] Autoscaler will roll the IPPR back and scale out. => WorkerGroup1=[{CPU: 4 -> 1, Available: 4 -> 1}, {CPU: 1, Available: 1}] **T3**: WorkerGroup1=[{CPU: 1, Available: 1}, {CPU: 1, Available: 1}] Pending Reqs=[{CPU: 2}] Autoscaler will try to resize the second node to 4 CPU: => WorkerGroup1=[{CPU: 1, Available: 1},{CPU: 4 -> 1, Available: 4 -> 1}] **T4**: WorkerGroup1=[{CPU: 1, Available: 1}, {CPU: 4, Available: 2}] Pending Reqs=[] ## Additional notes Things that are added in this PR: 1. `KubeRayIPPRProvider`: An instance that provides helper methods like `get_ippr_specs`, `get_ippr_statuses`, and `do_ippr_requests` to the autoscaler for doing IPPR with related structs `IPPRSpecs`, and `IPPRStatus`. 2. `AsyncResizeLocalResourceInstances` method has been added to `RayletClientWithIoContext`. The above are unit tested. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [x] Tested with Kubernetes 1.35 - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Yicheng-Lu-llll <51814063+Yicheng-Lu-llll@users.noreply.github.com>

> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. User facing docs for #55961 (comment) ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". Docs for #55961 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. - TODO: For master should I say v1.7 for KubeRay or not yet? - For the different cases I generated those text examples but would diagrams be better? Happy to create diagrams unless we feel this is ok --------- Signed-off-by: alimaazamat <alima.azamat2003@gmail.com> Signed-off-by: Alima Azamat <92766804+alimaazamat@users.noreply.github.com> Co-authored-by: Rueian <rueiancsie@gmail.com>

> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. User facing docs for ray-project#55961 (comment) ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". Docs for ray-project#55961 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. - TODO: For master should I say v1.7 for KubeRay or not yet? - For the different cases I generated those text examples but would diagrams be better? Happy to create diagrams unless we feel this is ok --------- Signed-off-by: alimaazamat <alima.azamat2003@gmail.com> Signed-off-by: Alima Azamat <92766804+alimaazamat@users.noreply.github.com> Co-authored-by: Rueian <rueiancsie@gmail.com>

rueian force-pushed the autoscaler-ippr branch 3 times, most recently from 07edb0f to cd521c0 Compare August 26, 2025 20:55

rueian requested a review from Copilot August 27, 2025 00:10

Copilot AI reviewed Aug 27, 2025

View reviewed changes

rueian force-pushed the autoscaler-ippr branch from cd521c0 to 05b44b9 Compare August 27, 2025 00:21

rueian changed the title ~~[core][autoscaler][IPPR] Initial impl for resizing Pods in-place to the maximum configured by the user~~ Aug 27, 2025

rueian added the go add ONLY when ready to merge, run all tests label Aug 27, 2025

rueian force-pushed the autoscaler-ippr branch 4 times, most recently from f661d9b to 388ba37 Compare August 27, 2025 06:15

rueian changed the title ~~[core][autoscaler][IPPR] Initial impl for resizing pods in-place to the maximum configured by the user~~ Aug 27, 2025

rueian force-pushed the autoscaler-ippr branch from f36b114 to c11febf Compare August 28, 2025 03:59

rueian commented Aug 28, 2025

View reviewed changes

Comment thread ci/env/install-core-prerelease-dependencies.sh Outdated

rueian commented Aug 28, 2025

View reviewed changes

rueian marked this pull request as ready for review August 28, 2025 16:07

rueian requested a review from a team as a code owner August 28, 2025 16:07

ray-gardener Bot added core Issues that should be addressed in Ray Core kubernetes labels Aug 28, 2025

rueian mentioned this pull request Sep 1, 2025

[Umbrella] Autoscaler IPPR E2E test ray-project/kuberay#4028

Open

github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 8, 2025

rueian commented Apr 27, 2026

View reviewed changes

rueian changed the title ~~[core][autoscaler][IPPR] Initial implementation for resizing pods in-place on Kubernetes 1.35~~ Apr 27, 2026

Yicheng-Lu-llll reviewed Apr 29, 2026

View reviewed changes

adress reviews

dd6017e

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

cursor Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/ippr_provider.py

Kunchd reviewed Apr 29, 2026

View reviewed changes

rueian added 2 commits April 29, 2026 14:48

address reviews

bd97a07

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

address review

79a1405

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

cursor Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/scheduler.py Outdated

address review

2af27c9

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

rueian requested review from Kunchd, Yicheng-Lu-llll and Copilot April 30, 2026 16:32

Copilot started reviewing on behalf of rueian April 30, 2026 16:33 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py

Comment thread python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/ippr_provider.py

Merge branch 'master' into autoscaler-ippr

80fdbfd

Kunchd approved these changes May 1, 2026

View reviewed changes

Yicheng-Lu-llll approved these changes May 5, 2026

View reviewed changes

edoakes merged commit 8445b42 into ray-project:master May 6, 2026
6 checks passed

alimaazamat mentioned this pull request May 7, 2026

[docs] KubeRay IPPR User Guide #63212

Merged

Uh oh!

Conversation

rueian commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

In-place Pod Resizing (IPPR) Integration on Kubernetes 1.35

Configuration and Validation

Resize Behavior

Before: without IPPR

After: red boxes are new behaviors for IPPR

Example cases

Case 1: Resizing an existing node

Case 2: Resizing an existing worker and scaling out a new worker.

Case 3: Scaling out with IPPR capacity considerations

Case 4: IPPR timeout

Additional notes

Related issue number

Checks

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rueian Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

jackfrancis commented Sep 4, 2025

edoakes commented Sep 5, 2025

rueian commented Sep 5, 2025

edoakes commented Sep 5, 2025

jackfrancis commented Sep 18, 2025

jackfrancis commented Sep 23, 2025

github-actions Bot commented Oct 8, 2025

jackfrancis commented Oct 9, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yicheng-Lu-llll left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Kunchd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Kunchd left a comment

Choose a reason for hiding this comment

rueian commented May 5, 2026

Uh oh!

rueian commented Aug 26, 2025 •

edited

Loading

rueian Aug 28, 2025 •

edited

Loading

Yicheng-Lu-llll left a comment •

edited

Loading