[core][ippr][7/N] Initial implementation for resizing pods in-place on Kubernetes 1.35#55961
Conversation
07edb0f to
cd521c0
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR introduces initial support for In-Place Pod Resize (IPPR) functionality in the Ray autoscaler for KubeRay clusters. IPPR allows pods to be resized without termination, improving resource utilization and reducing scheduling overhead by dynamically adjusting CPU and memory allocations based on demand.
Key changes:
- Adds IPPR schema validation and typed data structures for group specifications and pod status tracking
- Implements IPPR provider for KubeRay to handle resize requests and synchronization with Raylets
- Integrates IPPR logic into the resource demand scheduler to prefer in-place resizing over launching new nodes
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
python/ray/autoscaler/v2/schema.py |
Defines IPPR data structures including IPPRSpecs, IPPRGroupSpec, and IPPRStatus |
python/ray/autoscaler/v2/scheduler.py |
Integrates IPPR into scheduling logic to consider resizing existing pods before launching new ones |
python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/ippr_provider.py |
New provider implementing IPPR operations including validation, pod resizing, and Raylet synchronization |
python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py |
Integrates IPPR provider into KubeRay cloud provider |
python/ray/autoscaler/v2/instance_manager/reconciler.py |
Connects IPPR functionality to the main autoscaler reconciliation loop |
python/ray/autoscaler/v2/tests/test_ippr_provider.py |
Comprehensive test suite for IPPR provider functionality |
python/ray/autoscaler/v2/tests/test_scheduler.py |
Tests for IPPR integration in the scheduler |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
cd521c0 to
05b44b9
Compare
f661d9b to
388ba37
Compare
f36b114 to
c11febf
Compare
| DOC001: Method `__init__` Potential formatting errors in docstring. Error message: No specification for "Args": "" | ||
| DOC001: Function/method `__init__`: Potential formatting errors in docstring. Error message: No specification for "Args": "" (Note: DOC001 could trigger other unrelated violations under this function/method too. Please fix the docstring formatting first.) | ||
| DOC101: Method `KubeRayProvider.__init__`: Docstring contains fewer arguments than in function signature. | ||
| DOC103: Method `KubeRayProvider.__init__`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [cluster_name: str, k8s_api_client: Optional[IKubernetesHttpApiClient], provider_config: Dict[str, Any]]. |
| url, | ||
| json.dumps(payload), | ||
| headers={**headers, "Content-type": "application/json-patch+json"}, | ||
| headers={**headers, "Content-type": content_type}, |
There was a problem hiding this comment.
Make content-type adjustable for different patch strategies.
| self._ray_cluster = None | ||
| self._cached_instances: Dict[CloudInstanceId, CloudInstance] | ||
| self._ippr_provider = KubeRayIPPRProvider( | ||
| gcs_client=gcs_client, k8s_api_client=self._k8s_api_client |
There was a problem hiding this comment.
The KubeRayIPPRProvider needs a gcs_client to adjust the size of a Raylet, and it also needs a k8s_api_client to patch pods.
|
I validated this on a 3 node (16 CPU cores each) cluster in Azure: $ kubectl get pods -o='custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,CPU:.spec.containers[0].resources.limits.cpu,CPU:.spec.containers[0].resources.requests.cpu' -w
NAMESPACE NAME STATUS CPU CPU
default kuberay-operator-79947594b8-zbklb Running 100m 100m
default tpch-q1-sf-10-mtqx7-head-84n2m Running 2 250m
default tpch-q1-sf-10-mtqx7-head-84n2m Running 2 250m
default tpch-q1-sf-10-jw224 Pending 1 500m
default tpch-q1-sf-10-jw224 Pending 1 500m
default tpch-q1-sf-10-jw224 Pending 1 500m
default tpch-q1-sf-10-jw224 Pending 1 500m
default tpch-q1-sf-10-jw224 Running 1 500m
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Running 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Running 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Running 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Running 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Running 14 13Using a ray-operator image built from this PR commit. You can see above the 3 long-lived pods, and their CPU requests/limits values increasing over time (without a lifecycle event creating a new pod). tl;dr IPPR confirmed |
|
Still in my review queue, sorry haven't gotten to it yet (it's a big one!) @jjyao can you help review as well? I need to re-read a lot of autoscaler code. |
|
Hi @edoakes @jjyao, the previous feedback I got is to replace |
Sounds good. I will do a quick scan then and hold off to dive into the details. |
|
PSA: in-place is planned for graduation to GA in v1.35.0: kubernetes/enhancements#5562 |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
| """ | ||
| return ( | ||
| self.resizing_at is not None | ||
| and not self.need_sync_with_raylet() |
There was a problem hiding this comment.
This is to address This is to address https://github.com/ray-project/ray/pull/55961/changes#r3149423782
| for node in existing_nodes: | ||
| if node.ippr_status is not None: | ||
| if ( # Reflect finished / ongoing IPPR in node capacity | ||
| node.ippr_status.is_k8s_resize_finished() | ||
| or node.ippr_status.is_in_progress() | ||
| ): | ||
| # While a resize is ongoing or just completed, use desired values | ||
| # as the node's capacity so binpacking can consider the change. | ||
| node.update_total_resources( | ||
| { | ||
| "CPU": node.ippr_status.desired_cpu, | ||
| "memory": node.ippr_status.desired_memory, | ||
| } | ||
| ) |
There was a problem hiding this comment.
This step maps to the "Extend node's availabilities for finished/ongoing IPPR" in the diagram in the PR description.
| ippr_candidates = [] | ||
|
|
||
| for node in existing_nodes: | ||
| if node.ippr_status is not None and node.ippr_status.can_resize_up(): | ||
| ippr_candidates.append(node) | ||
| else: | ||
| target_nodes.append(node) | ||
|
|
||
| original_ippr_candidates = { | ||
| node.im_instance_id: copy.deepcopy(node) for node in ippr_candidates | ||
| } | ||
| for node in ippr_candidates: | ||
| # Expose per-node maximums so binpacking can evaluate placing more work | ||
| # by upsizing in-place rather than launching new nodes. | ||
| node.update_total_resources( | ||
| { | ||
| "CPU": node.ippr_status.max_cpu(), | ||
| "memory": node.ippr_status.max_memory(), | ||
| } | ||
| ) |
There was a problem hiding this comment.
This step maps to the "Extend node's availabilities if they are able to do IPPR" in the diagram in the PR description.
| while len(requests_to_sched) > 0 and len(ippr_candidates) > 0: | ||
| ( | ||
| best_node, | ||
| requests_to_sched, | ||
| ippr_candidates, | ||
| ) = ResourceDemandScheduler._sched_best_node( | ||
| requests_to_sched, | ||
| ippr_candidates, | ||
| resource_request_source, | ||
| ctx.get_cloud_resource_availabilities(), | ||
| ) | ||
| if best_node is None: | ||
| # No ippr nodes can schedule any more requests. | ||
| break | ||
|
|
||
| # Commit an IPPR action on the selected node to its max effective caps. | ||
| best_node.ippr_status.queue_resize_request( | ||
| desired_cpu=best_node.ippr_status.max_cpu(), | ||
| desired_memory=best_node.ippr_status.max_memory(), | ||
| ) |
There was a problem hiding this comment.
This step maps to the "Bin pack remaining demands to extended nodes" in the diagram in the PR description.
| status=SchedulingNodeStatus.TO_LAUNCH, | ||
| node_kind=NodeKind.WORKER, | ||
| ) | ||
| # If the new node can be resized, consider its maximum IPPR capacity. |
There was a problem hiding this comment.
nit: it feels like we could directly create pods with the desired IPPR resource size if needed. Right now, we first start/run a pod and then wait for a follow up reconcile to perform IPPR. We could optimize this in a separate PR if needed.
I'm also thinking through a scenario.
Suppose K8s only has 2 CPUs left, and we get a request to schedule a task that needs 2 CPUs. Our available nodes have a default resource of 1 CPU but can be IPPR'd up to 3 CPUs.
Before IPPR, I think we would simply say we can't schedule this task, right?
After IPPR, it feels like we might end up starting two nodes each with 1 CPU, and then both IPPR attempts would fail (since K8s is out of CPU).
At that point we'd realize we can't schedule the task, but would we then clean up these two extra nodes?
If we do clean up, will we ends up restart running the pods and end up in a loop?
There was a problem hiding this comment.
Before IPPR, I think we would simply say we can't schedule this task, right?
Correct.
After IPPR, it feels like we might end up starting two nodes each with 1 CPU, and then both IPPR attempts would fail (since K8s is out of CPU).
With this IPPR, the autoscaler will create 1 Pod, starting with 1 CPU, and then try to resize it to 3 CPUs. K8s will tell the autoscaler it has 2 CPUs left, and the autoscaler will resize the pod again to 2 CPUs.
There was a problem hiding this comment.
Got it, so if a task requires 3 CPUs, we might end up in a loop where we create a pod, resize it to 2 CPUs, let it sit idle, terminate it, and then start over again after some time.
That said, I think this is a corner case and I don't have a strong opinion on it. Just flag it.
There was a problem hiding this comment.
Yeah, if we truly have no capacity, we can have a loop like this, and to be honest, we don't have too many choices.
One benefit of creating a Pod with 3 CPUs directly is that it can trigger horizontal scaling on the underlying k8s, while the current approach relies much more on the vertical scaling capability of a k8s node.
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Kunchd
left a comment
There was a problem hiding this comment.
Glad to see everything come together! Left some nits and questions.
| self.total_resources[resource_name] = max(0.0, new_total) | ||
| for available in self.available_resources_for_sched.values(): | ||
| available[resource_name] = max( | ||
| 0.0, available.get(resource_name, 0.0) + delta |
There was a problem hiding this comment.
If we support downsizing in the future, how will we decide where the existing resources on this node that can no longer fit onto this downsized node will go?
There was a problem hiding this comment.
Those resources will be out of our control if we downsize the node. That will allow other pods to use them.
| assert len(reply.to_ippr) == 1 | ||
| assert reply.to_ippr[0].cloud_instance_id == "pod-1" | ||
| assert reply.to_ippr[0].desired_cpu == 4.0 | ||
| assert reply.to_ippr[0].desired_memory == 8 * 1024 * 1024 * 1024 |
There was a problem hiding this comment.
Personal preference: Could we also ensure no new node creation was issued?
| # Pending nodes without a ray_node_id should not be selected for IPPR. | ||
| assert reply.to_ippr == [] | ||
| # Scheduler should launch a new node instead of overestimating pending capacity. | ||
| to_launch, _ = _launch_and_terminate(reply) |
There was a problem hiding this comment.
Should the scheduler actually create a new node if the pending node will be able to meet the request with IPPR once it starts? Couldn't we also wait until the node starts and trigger IPPR on a future iteration?
There was a problem hiding this comment.
Makes sense. Good catch! We should not create a new node at this point if the pending node will be able to meet the request.
| # Only one IPPR candidate is selected for this gang request. | ||
| assert len(reply.to_ippr) == 1 | ||
| assert {status.cloud_instance_id for status in reply.to_ippr} == {"pod-1"} | ||
| assert {status.desired_cpu for status in reply.to_ippr} == {4.0} |
There was a problem hiding this comment.
Do we want to check that instance 2 is not modified here?
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 79a1405. Configure here.
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi @edoakes, please take a look and merge when you have a chance. Thank you! |
…n Kubernetes 1.35 (ray-project#55961) ## In-place Pod Resizing (IPPR) Integration on Kubernetes 1.35 Things that have been done: 1. IPPR JSON configuration and validation for users to enable the IPPR integration with Autoscaler v2. 2. Resize a Pod's CPU and memory resource requests and limits to the maximums specified in one step by the config. ## Configuration and Validation Users can provide a `ray.io/ippr` annotation on their RayCluster CR to enable IPPR with Autoscaler v2: ``` { "groups": { "<groupName>": { "max-cpu": string|number, # K8s quantity (e.g. "2", "1500m") "max-memory": string|integer, # K8s quantity (e.g. "8Gi", 2147483648) "resize-timeout": integer # Seconds to wait for a pod resize to # complete before considering it timed out }, ... } } ``` `groupName` should match the names of Ray worker groups. In each group, `max-cpu`, `max-memory`, and `resize-timeout` are mandatory. Besides the above configuration, we also validate: 1. The corresponding worker groups can't have `num-cpus` and `memory` in their `rayStartParams` because they can cause Ray logical resource mismatch with pod resources. 2. Worker groups should also have `cpu` and `memory` resource requests specified in their container specs. 3. In addition, their container should have `resizePolicy.restartPolicy` set to `NotRequired`. ## Resize Behavior The current implementation will try to resize the existing nodes to the maximum specified by the user in one step if there are pending tasks that can fit on those nodes after resizing. We will implement gradual resizing and downsizing in later PRs. The detailed behavior is 1. After filling pending tasks to the existing nodes with their current capacities, the autoscaler will try to fill the remaining pending tasks to those nodes that have no ongoing resize again, but with their maximum capacities specified in the `ray.io/ippr` annotation. If there are remaining pending tasks that can be fit on a node, the autoscaler will send its k8s resize request and record the resize status in a pod annotation, `ray.io/ippr-status`, at the end of the current reconciliation. 2. If there are still pending tasks left, the autoscaler will do the original horizontal scale out, but with the maximum capacity of each worker type in consideration. 3. At the beginning of the next reconciliation, the autoscaler will determine the next step for those resize that have been sent at the end of the previous reconciliation by looking into their statuses. The next step can be two cases: **a)** Finish the resize by adjusting the logical resources on the Raylet and update its `ray.io/ippr-status`. **b)** Adjust the resize by queueing a new k8s resize request due to a timeout or an error. Note that if the RPC to adjust the logical resources on the Raylet fails, the autoscaler will retry again in the next reconciliation because it doesn't update the corresponding `ray.io/ippr-status`. ### Before: without IPPR <img width="940" height="428" alt="image" src="https://github.com/user-attachments/assets/deba6ece-4936-4658-ab29-a6eea5ec49ad" /> ### After: red boxes are new behaviors for IPPR <img width="947" height="830" alt="image" src="https://github.com/user-attachments/assets/1ddd03fe-74bc-45d3-a384-895fa5c52dbc" /> ### Example cases The following are different cases for IPPR. They are all based on this cluster shape: HeadNode * 1 with CPU=0 for simplicity. WorkerGroup1: CPU: 1 (can be resized to 4) MaxReplicas >= 1 #### Case 1: Resizing an existing node **T1**: (There is 1 worker idle and we have a pending request for 2 CPUs) WorkerGroup1=[{CPU: 1, Available: 1}] Pending Reqs=[{CPU: 2}] Autoscaler will try resize the worker to 4 CPUs: => WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}] **T2**: (The worker is resized and the pending request is consumed) WorkerGroup1=[{CPU: 4, Available: 2}] Pending Reqs=[] #### Case 2: Resizing an existing worker and scaling out a new worker. **T1**: (There is 1 worker idle and we have pending requests for 2 and 4 CPUs) WorkerGroup1=[{CPU: 1, Available: 1}] Pending Reqs=[{CPU: 2}, {CPU: 4}] Autoscaler will try resize the node to 4 CPUs and add a new worker: => WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}, {CPU: 1, Available: 1}] **T2**: (The first worker is resized and the 4 CPUs request is consumed ) WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 1, Available: 1}] Pending Reqs=[{CPU: 2}] Autoscaler will then try resize the second worker to 4 CPUs: => WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 1 -> 4, Available: 1 -> 4}] **T3**: (The second worker is resized and the remaining request is consumed) WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 4, Available: 2}] Pending Reqs=[] #### Case 3: Scaling out with IPPR capacity considerations **T1**: (no worker initially but there is a pending request for 2 CPUs) WorkerGroup1=[] Pending Reqs=[{CPU: 2}] Autoscaler will still add a new node because the worker has the IPPR capacity: => WorkerGroup1=[{CPU: 1, Available: 1}] **T2**: WorkerGroup1=[{CPU: 1, Available: 1}] Pending Reqs=[{CPU: 2}] Then this will be the same as Case 1. #### Case 4: IPPR timeout **T1**: WorkerGroup1=[{CPU: 1, Available: 1}] Pending Reqs=[{CPU: 2}] Autoscaler first tries to resize the node to 4 CPU: => WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}] **T2**: (But if the IPPR times out,) WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}]] Pending Reqs=[{CPU: 2}] Autoscaler will roll the IPPR back and scale out. => WorkerGroup1=[{CPU: 4 -> 1, Available: 4 -> 1}, {CPU: 1, Available: 1}] **T3**: WorkerGroup1=[{CPU: 1, Available: 1}, {CPU: 1, Available: 1}] Pending Reqs=[{CPU: 2}] Autoscaler will try to resize the second node to 4 CPU: => WorkerGroup1=[{CPU: 1, Available: 1},{CPU: 4 -> 1, Available: 4 -> 1}] **T4**: WorkerGroup1=[{CPU: 1, Available: 1}, {CPU: 4, Available: 2}] Pending Reqs=[] ## Additional notes Things that are added in this PR: 1. `KubeRayIPPRProvider`: An instance that provides helper methods like `get_ippr_specs`, `get_ippr_statuses`, and `do_ippr_requests` to the autoscaler for doing IPPR with related structs `IPPRSpecs`, and `IPPRStatus`. 2. `AsyncResizeLocalResourceInstances` method has been added to `RayletClientWithIoContext`. The above are unit tested. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [x] Tested with Kubernetes 1.35 - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Yicheng-Lu-llll <51814063+Yicheng-Lu-llll@users.noreply.github.com>
> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. User facing docs for #55961 (comment) ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". Docs for #55961 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. - TODO: For master should I say v1.7 for KubeRay or not yet? - For the different cases I generated those text examples but would diagrams be better? Happy to create diagrams unless we feel this is ok --------- Signed-off-by: alimaazamat <alima.azamat2003@gmail.com> Signed-off-by: Alima Azamat <92766804+alimaazamat@users.noreply.github.com> Co-authored-by: Rueian <rueiancsie@gmail.com>
> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. User facing docs for ray-project#55961 (comment) ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". Docs for ray-project#55961 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. - TODO: For master should I say v1.7 for KubeRay or not yet? - For the different cases I generated those text examples but would diagrams be better? Happy to create diagrams unless we feel this is ok --------- Signed-off-by: alimaazamat <alima.azamat2003@gmail.com> Signed-off-by: Alima Azamat <92766804+alimaazamat@users.noreply.github.com> Co-authored-by: Rueian <rueiancsie@gmail.com>

In-place Pod Resizing (IPPR) Integration on Kubernetes 1.35
Things that have been done:
Configuration and Validation
Users can provide a
ray.io/ipprannotation on their RayCluster CR to enable IPPR with Autoscaler v2:groupNameshould match the names of Ray worker groups. In each group,max-cpu,max-memory, andresize-timeoutare mandatory.Besides the above configuration, we also validate:
num-cpusandmemoryin theirrayStartParamsbecause they can cause Ray logical resource mismatch with pod resources.cpuandmemoryresource requests specified in their container specs.resizePolicy.restartPolicyset toNotRequired.Resize Behavior
The current implementation will try to resize the existing nodes to the maximum specified by the user in one step if there are pending tasks that can fit on those nodes after resizing. We will implement gradual resizing and downsizing in later PRs. The detailed behavior is
ray.io/ipprannotation. If there are remaining pending tasks that can be fit on a node, the autoscaler will send its k8s resize request and record the resize status in a pod annotation,ray.io/ippr-status, at the end of the current reconciliation.a) Finish the resize by adjusting the logical resources on the Raylet and update its
ray.io/ippr-status.b) Adjust the resize by queueing a new k8s resize request due to a timeout or an error.
Note that if the RPC to adjust the logical resources on the Raylet fails, the autoscaler will retry again in the next reconciliation because it doesn't update the corresponding
ray.io/ippr-status.Before: without IPPR
After: red boxes are new behaviors for IPPR
Example cases
The following are different cases for IPPR. They are all based on this cluster shape:
HeadNode * 1 with CPU=0 for simplicity.
WorkerGroup1:
CPU: 1 (can be resized to 4)
MaxReplicas >= 1
Case 1: Resizing an existing node
T1: (There is 1 worker idle and we have a pending request for 2 CPUs)
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]
Autoscaler will try resize the worker to 4 CPUs:
=> WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}]
T2: (The worker is resized and the pending request is consumed)
WorkerGroup1=[{CPU: 4, Available: 2}]
Pending Reqs=[]
Case 2: Resizing an existing worker and scaling out a new worker.
T1: (There is 1 worker idle and we have pending requests for 2 and 4 CPUs)
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}, {CPU: 4}]
Autoscaler will try resize the node to 4 CPUs and add a new worker:
=> WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}, {CPU: 1, Available: 1}]
T2: (The first worker is resized and the 4 CPUs request is consumed )
WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]
Autoscaler will then try resize the second worker to 4 CPUs:
=> WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 1 -> 4, Available: 1 -> 4}]
T3: (The second worker is resized and the remaining request is consumed)
WorkerGroup1=[{CPU: 4, Available: 0}, {CPU: 4, Available: 2}]
Pending Reqs=[]
Case 3: Scaling out with IPPR capacity considerations
T1: (no worker initially but there is a pending request for 2 CPUs)
WorkerGroup1=[]
Pending Reqs=[{CPU: 2}]
Autoscaler will still add a new node because the worker has the IPPR capacity:
=> WorkerGroup1=[{CPU: 1, Available: 1}]
T2:
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]
Then this will be the same as Case 1.
Case 4: IPPR timeout
T1:
WorkerGroup1=[{CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]
Autoscaler first tries to resize the node to 4 CPU:
=> WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}]
T2: (But if the IPPR times out,)
WorkerGroup1=[{CPU: 1 -> 4, Available: 1 -> 4}]]
Pending Reqs=[{CPU: 2}]
Autoscaler will roll the IPPR back and scale out.
=> WorkerGroup1=[{CPU: 4 -> 1, Available: 4 -> 1}, {CPU: 1, Available: 1}]
T3:
WorkerGroup1=[{CPU: 1, Available: 1}, {CPU: 1, Available: 1}]
Pending Reqs=[{CPU: 2}]
Autoscaler will try to resize the second node to 4 CPU:
=> WorkerGroup1=[{CPU: 1, Available: 1},{CPU: 4 -> 1, Available: 4 -> 1}]
T4:
WorkerGroup1=[{CPU: 1, Available: 1}, {CPU: 4, Available: 2}]
Pending Reqs=[]
Additional notes
Things that are added in this PR:
KubeRayIPPRProvider: An instance that provides helper methods likeget_ippr_specs,get_ippr_statuses, anddo_ippr_requeststo the autoscaler for doing IPPR with related structsIPPRSpecs, andIPPRStatus.AsyncResizeLocalResourceInstancesmethod has been added toRayletClientWithIoContext.The above are unit tested.
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.